您好,欢迎光临本网站![请登录][注册会员]  
文件名称: ZooKeeper’s atomic broadcast protocol: Theory and practice
  所属分类: 其它
  开发工具:
  文件大小: 382kb
  下载次数: 0
  上传时间: 2019-10-14
  提 供 者: u0100*****
 详细说明:zookeeper 系统中的主要分布式协调算法,有助于深入理解zkThe original Paxos protocol does not enable multiple outstanding transactions Paxos does not require FIFO channels for communication, so it tolerates message loss and reordering. If two outstanding transactions have an order dependency, then Pa cannot have multiple outstanding transactions because FIFO order is not guaranteed This problem could be solved by batching multiple transactions into a single proposal and allowing at most one proposal at a time, but this has performance draw backs The manipulation of the sequence of transactions to use during recovery from pri mary crashes is claimed to not be efficient enough in Paxos 12. Zab improves this aspect by employing a transaction identification scheme to totally order the transac tions. Under the scheme, in order to update the application state of a new primary process, it is sufficient to inspect the highest transaction identifier from each process and to copy transactions only from the process that accepted the transaction with the highest identifier. In Paxos, the same idea cannot be applied on sequence numbers so a new primary has to execute Phase l of Paxos for all previous sequence numbers for which the primary has not "learned a value(in Zab terminology, " committed a transaction”) Additional performance requirements [19 for ZooKeeper are: (i) low latency. (ii) good throughput under bursty conditions, handling situations when write workloads Increase ra pidly during e g, massive system reconfiguration, and (iii)smooth failure handlin, so that the service can stay up when some non-leader server crashes 2.2 Crash-recovery system model ZooKeeper assumes the crash-recovery model as system model 12. The system a set, of processes ii=p1, p2,...Px, also referred to as peers in this report, that communicate by message passing, are each equipped with a stable storage device, and may crash and recover indefinitely many times. a quorum of II is a subset Q c I such that Q>N 2. Any two quorums have a non-empty intersection. Processes have two states: up and down. a process is down from a crash time point to the beginning of its recovery, and up from the beginning of a recovery until the next crash happens There is a. bidirectional channel for every pair of processes in IL, which is expected to satisfy the following properties: (i) integrity, asserting that process p, receives a message m, from pi only if pi has sent m; and (ii) prefix, stating that if process p receives a message m and there is a message m' that precedes m in the sequence of messages pi sent to pi, then pi receives m before m. To achieve these properties ZooKeeper uses TCP- therefore FIFO- channels for communication 2.3 Expected properties To guarantee that processes are consistent, there are a couple of safety properties to be satisfied by Zab. Additionally, for allowing multiple outstanding operations. we require primary order properties. To state these properties we first need some definitions In ZooKeepers crash-recovery model, if the primary process crashes, a new primary process needs to be elected. Since broadcast messages are totally ordered, we require at most one primary active at a time. So over time we get an unbounded sequence of primary processes P1p2... Pe.. where pe E Il. and e is an integer called epoch representing a period of time when Pe was the single primary in the ensemble. Process Pe precedes pe/, denoted pe pp/, if e then either pi delivers(v, 2l>or p, delivers (v, 2) Prinary order properties 12 are given below Local primary order: If a primary broadcasts(e, z) before it broadcasts(, 2> then a process that delivers(v, 2) must have delivered (v, 2) before(v, 2> Global primary order: Suppose a primary Pi broadcasts (v, 2), and a primary Pi>pa broadcasts(,, a). If a, process delivers both(u, 2) and,, 2), then must deliver (v, 2) before(, 2> Primary integrity: If a primary Pe broadcasts (v, 2) and some process delivers fU, a'> which was broadcast by Pel pe, then Pe must have delivered(v, a', before broadcasting〈U,). Local primary order corresponds to FIFO order. Primary integrity guarantees that a primary has delivered transactions from previous epochs 3 Atomic broadcast protocol In Zab, there are three possible (non-persistent )states a peer can assume: following, leading or election. Whether a peer is a, follower or a leader, it executes three zab hases:(1)discovery,(2) synchronization, and(3) broadcast, in this order. Previous to Phase 1, a peer is in state election, when it executes a leader election algorithm to look for a peer to vote for becoming the leader. At the beginning of Phase l, the peer inspects its vote and decides whether it should become a follower or a leader. For this reason. leader election is sometimes called phase 0 The leader peer coordinates the phases together with the followers, and there should be at most one leader peer in Phase 3 at a time, which is also the primary process to broadcast messages. In other words. the primary is always the leader. Phases l and 2 are important for bringing the ensemble to a mutually consistent state, specially when recovering from crashes. They constitute the recovery part of the protocol and are criti cal to guarantee order of transactions while allowing multiple outstanding transactions If crashes do not occur, peers stay indefinitely in Phase 3 participating in broadcasts similar to the two phase commit protocol 9. During Phases 1, 2, and 3, peers can decide to go back to leader election if any failure or timeout occurs ZooKeeper clients are applications that use ZooKeeper services by connecting to at least one server in the ensemble. The client submits operations to the connected server, and if this operation implies some state change, then the Zab layer will perform broadcast. If the operation was submitted to a follower, it is forwarded to the leader peer. If a leader receives the operation request, then it executes and propagates the state change to its followers. Read requests from the client are directly served by any ZooKeeper server. The client can choose to guarantee that the replica is up-to-date by issuing a sync request to the connected ZooKeeper server In Zab, transaction identifiers(zxid are crucial for implementing total order prop erties. The zxid x of a transaction u, a) is a pair (e. c), where e is the epoch number of the primary pe that generated the transaction( v, i), and c is an integer acting as counter. The notation 2.epoch means e, and 2. counter =c. The counter c is incre mented every time a new transaction is introduced by the primary When a new epoch starts-a new leader becomes active- c is set to zero and e is incremented from what was known to be the previous epoch. Since both e and c are increasing, transactions can be ordered by their zxid. For two zxids(e, c) and(/, c'>, we write(e, c)2el, d) if e< e or ife= el and c< c There are four variables that constitute th used during the recovery part of the protoco e persistent state of a peer, which are history: a log of transaction proposals accepted acceptedEpoch: the epoch number of the last NEWEPOCH message accepted currentEpoch: the epoch number of the last NEWLEADER message accepted lastZxid: zxid of the last proposal in the history We assume some mechanism to determine whether a transaction proposal in the history has been committed in the peer's ZooKeeper database. The e variable names above follow the terminology of 18, while in 12 they are different: history of a peer f is hf, acceptedEpoch is f-p, currentEpoch is fa, and lastZxid is f2rid 3.1 Phases of the protocol The four phases of the Zab protocol are described next Phase 0: Leader election Peers are initialized in this phase, having state election No specific leader election protocol needs to be employed, as long as the protocol terminates, with high probability, choosing a peer that is up and that a quorum of peers voted for. After termination of the leader election algorithm, a peer stores its vote to local volatile memory. If peer p voted for peer p, then p' is called the prospective leader for p. Only at the beginning of phase 3 does a prospective leader become an established leader, when it will also be the primary process. If the peer has voted for itself, it shifts to state leading, otherwise it changes to state following Phase 1: Discovery In this phase, followers communicate with their prospective leader, so that the leader gathers information about the most recent transactions that its followers accepted. The purpose of this phase is to discover the most updated sequence of accepted transactions among a quorum, and to establish a new epoch so that previous leaders cannot commit new proposals. The complete description of this phase is described in Algorithm1 1 Follower f. 2 Send the message FOLLOWERINFO(F. acceptedEpoch)to L 3 upon receiving NEWEPOCH(e from L do if e>> F.acceptedEpoch then F. acceptedEpoch e for all e received through FOLLOWERINFO(e) Propose NEWEPOCH(c to all followers in Q 17 upon receiving ACKEPOCH from all followers in do Find the follower f in Q such that for all f′∈Q\{八}: either f. currentEpoch Fhistory such that 2<2 z do hing(wait) end Commit(deliver)transaction e, 2) 28 end Algorithm 3: Zab phase 3 Broadcast algorithms L 2 and B] are apparently asynchronous and do not take into account possible peer crashes. To detect failures, Zab employs periodic heartbeat messages between followers and their leaders. If a leader does not receive heartbeats from a quorun of followers within a given timeout, it abandons its leadership and shifts to state election and Phase 0. a follower also goes to Leader Election Phase if it does not receive heart beats from its leader within a timeout 3.2 Analytical results We briefly mention some formal properties that Zab satisfies, and their correspond- ing proofs were given in Junqueira et al. 1112. The invariants are simple to show oy inspecting the three algorithms, while claims are carefully demonstrated using the invariants Invariant 1 22 In Broadcast Phase, a follower F accepts a proposal(e, ( u, z))only if F. currentEpoch =e Invariant 212 During the Broadcast Phase of epoch e, if a follower F has F. currentEpoch =e, then F accepts proposals and delivers transactions according to arid order Invariant 3 12 During Phase 1, a follower F will not accept proposals from the leader of any epoch e< F. acceptedEpoch Invariant 4[12 In Phase 1, an ACKEPOCH(F. currentEpoch, Fhistory, F.lastzxid) message does not alter, reorder, or lose transactions in Fhistory. In Phase 2, a NEWLEADER(e!, Lhistory )message does not alter, reorder, or lose transactions in Lhistory Invariant 5 12 The sequence of transactions a follower f delivered while in phase 3 of epoch F.currentEpoch is contained in the sequence of transactions broadcast by prinary PEe, where F e denotes the last epoch e such that f learned that e has been committed Claim 1 11y For every epoch number e, there is at most one process that calls ready(e in broadcast phase Claim 2 12 Zab satisfies the properties from Section 2. 3: broadcast integrity, agree ment, total order, local primary order, global primary order, and primary integrity Claim 3 12 Liveness property: Suppose thal a quorun Q of followers is up, the Followers in q have l as their prospective leader, L is up, and messages between a Follower in Q and are received in a linely fashion. IfL proposes a transaction (e, U, a)), then(e, U, 2 is eventually committed 4 Implementation Apache ZooKeeper is written in Java, and the version we have used for studying the implementation was 3.3.3 8. Version 3. 3.4 is the latest stable version(to this date) but this has very little differences in the Zab layer. Recent unstable versions have significant changes, though Most of the source code is dedicated to ZooKeepers storage functions and client communication. Classes responsible for Zab are deep inside the implementation. As mentioned in Section 2. 2 TCP connections are used to implement the bidirectional channels between peers in the ensemble. The FIfO order that TCP communication satisfies is crucial for the correctness of the broadcast protocol The Java implementation of Zab roughly follows algorithms四囡and图 Several optimizations were added to the source code, which make the actual implementation look significantly different from what we have seen in the previous section. In par ticular, the default leader election algorithm for Phase 0 is tightly coupled with the implementation of Phase 1 Fast Leader Election(FLE) is the name of the default leader election algorithm in the implementation. This algorithm employs an optimization: It attempts to elect as leader the peer that has the most up-to-date history from a quorum of processes. When such a leader is elected. in Phase 1 it will not need to communicate with followers to discover the latest history. Even though other leader election algorithms are supported by the implementation. in reality Phase 1 was modified to require that Phase 0 elects a leader with the most up-to-date history In practice, since FLE covers the discovery responsibility of Phase l, this phase has been neglected in version 3.3.3(and also 3.3.4) of ZooKeeper. There is no clear distinc tion between Phases I and 2 in the implementation, so we refer to the combination of both as recovery phase. This phase comes after Phase 0, and assumes that the leader has the latest history in a quorum. Algorithm 4 is an approximate pseudocode of the Recovery Phase, and Figure I] compares the implemented phases to Zab's phases Lab protocol Phasc 0 Phasc 2 (Leader Flection) (Discovery (Synchronization) (乃 broadcast) Implemented protocol Fast Leader election Recovery phase Broadcast phase Figure 1: Comparison betweens the phases of Zab protocol and the implemented pro tocol
(系统自动生成,下载前可以参看下载内容)

下载文件列表

相关说明

  • 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
  • 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度
  • 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
  • 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
  • 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
  • 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.
 输入关键字,在本站1000多万海量源码库中尽情搜索: