ZAB is the distributed consensus protocol used by Zookeeper, the full name in English: Zookeeper Atomic Broadcast, so ZAB is also called Zookeeper atomic broadcast protocol. In addressing distributed consistency, Zookeeper does not use Paxos, but the ZAB protocol. Based on the ZAB protocol, Zookeeper implements a system architecture in primary/standby mode to maintain data consistency between primary/standby replicas in the cluster. The ZAB protocol includes two basic modes: Message Broadcasting and Leader Activation. Here's a closer look at how to implement these two basic patterns.

Message broadcast

Message broadcasting is a method used by Zookeeper to ensure write transactional consistency, and in a Zookeeper cluster, there are nodes with the following three roles:

Leader: The core role of the Zookeeper cluster, which is elected by the follower in cluster startup or crash recovery, provides read and write services for clients, and processes transaction requests.

Follower: The core role of the Zookeeper cluster, participating in the election in cluster startup or crash recovery, not elected is this role, providing read services for clients, that is, processing non-transactional requests, Follower cannot process transaction requests, and will forward to the Leader for received transaction requests.

Observer: Observer role, does not participate in elections, provides read services for clients, handles non-transactional requests, and forwards received transactional requests to Leader. The purpose of using Observer is to scale the system and improve read performance.

The following is a brief introduction to the message broadcasting process of ZAB through several figures.

Zookeeper nodes receive requests from clients, and if they are non-transactional, each node handles them accordingly. If a transaction request is received from a client, if the current node is a follower, the request is forwarded to the leader node in the current cluster for processing.

Intuitive understanding: Zookeeper distributed consensus protocol ZAB

After receiving the transaction request, the leader will issue a proposal to all follower nodes and wait for Ack feedback from each follower. Before broadcasting a transaction, the Leader server will assign a global monotonically increasing unique ID to the transaction, that is, the transaction ID (zxid), and each transaction must be processed in the order of zxid. And the leader server allocates a separate queue for each follower, and then puts the transactions that need to be broadcast into the queue.

Each Follower node Ack feedback the proposal of the Leader node, and the Leader counts the received Ack, and if more than half of the Followers perform Ack, the next step is carried out at this time, otherwise the transaction request to the client fails to respond.

If the Leader node receives more than half of the Ack responses, the Leader will issue transaction commit instructions to all followers, execute a commit itself, and make a successful transaction request response to the client.

Zookeeper's message broadcasting process is similar to 2PC (Two Phase Commit), ZAB only needs more than half of the follower to return Ack information to perform the commit, greatly reducing synchronization blocking and improving availability.

Crash recovery

During the startup and operation of the Zookeeper cluster, if the leader crashes, disconnects the network, stops or restarts the service, or if a new server joins the cluster, ZAB will quickly put the current cluster into crash recovery mode and elect a new leader node, during which the entire cluster does not provide any read services. When a new leader is generated and more than half of the followers in the cluster complete synchronization with the leader, the ZAB protocol will convert the Zookeeper cluster from crash recovery mode to message broadcast mode. The purpose of crash recovery is to ensure that the current Zookeeper cluster quickly elects a new leader and completes the state synchronization with other followers, so as to enter the message broadcast mode as soon as possible to provide external services.

The main task of Zookeeper crash recovery is to elect the Leader (Leader Election), which is divided into two scenarios: one is the Leader election when the Zookeeper server starts, and the other is the Leader election after the Leader crashes during the operation of the Zookeeper cluster. Before going into detail about the Leader election process, a few parameters need to be introduced:

myid: The server ID, this is configured when installing Zookeeper, the larger the myid, the greater the priority of the server being elected as the leader in the election.
zxid: transaction ID, this is the globally unique transaction ID generated by the leader node in the Zookeeper cluster when proposing, because only the leader can do the proposal, so this zxid is easy to be globally unique and self-increasing. Because Follower does not have permission to generate zxid. The larger the zxid, the most recent transaction is successfully committed on the current node, which is why zxid needs to be prioritized when recovering from a crash.
epoch: Voting round, each time the voting of the leader election is completed, the epoch of the current leader node will increase once. In the absence of a Leader, the epoch of this round will remain unchanged.

In addition, during the election process, the current state of each node will be transformed among the following states.

LOOKING: Campaign status.
FOLLOWING: Follower status, synchronize Leader status, participate in the voting process of Leader election.
OBSERVING: Observe the state, synchronize the leader status, and do not participate in the voting process of the leader election.
LEADING: LEADER STATUS.

Leader election when the cluster starts

Suppose there is a cluster of 5 Zookeeper servers Sever1, Sever2, Sever3, Sever4, and Sever5 with myids of 1, 2, 3, 4, 5. Start sequentially in the order of increasing myid. Since both zxid and epoch were 0 at the start, the key factor in the Leader election became myid.

Start Sever1, at this time only Sever1 in the entire cluster starts, Sever1 can not establish communication with any other service, immediately enter the LOOKING state, at this time Server1 votes for itself (come up and feel that they can be the leader), because 1 is not more than half of the total number of clusters, that is, 2, Sever1 maintains the LOOKING state at this time.
Start Sever2, at this time Sever2 establishes communication with Server1, Sever1 and Sever2 exchange voting information with each other, Server1 votes myid is 1, Server2 votes myid is 2, at this time select the largest myid, so Sever1's vote will become 2, but because the number of servers voting Server2 is 2, less than half of the total number of clusters2, Therefore, Sever1 and Sever2 continue to remain LOOKING.
Start Sever3, at this time communication is established between the three servers, Server3 enters the LOOKING state, and exchanges voting information with the first two servers, Server1 and Server2 vote for 2, Server3 votes itself, that is, myid is 3, at this time choose myid as the leader. At this time, the number of servers voting 3 in the cluster becomes 3, at this time 3>2, Sever3 immediately becomes LEADING, and Sever1 and Sever2 become FOLLOWING.
Start Sever4, Sever4 enters the LOOKING state and establishes communication with the first three servers, because there are already nodes in the LEADING state in the cluster, so Sever4 immediately changes to the FOLLOWING state, at this time Sever3 is still in the LEADING state.

5. Move Sever5, Sever5, like Sever4, will immediately become FOLLOWING after establishing communication with other servers, and Sever3 is still in the LEADING state.

In the end, in the entire Zookeeper cluster, Server3 becomes the leader, Server1, Server2, Server4 and Server5 become the follower, and finally Server3's epoch is added by one.

Leader election when the Leader crashes

When the Zookeeper cluster first started, zxid and epoch did not participate in the group leader election. But if the Zookeeper cluster crashes after running for a while, epoch and zxid will be more important in the leader election than myid. The order of importance is: epoch>zxid>myid. When a follower loses communication with the leader, it will enter the leader election, at which time the follower will communicate with other nodes in the cluster, but there will be two situations:

The follower and the leader lost communication, but the follower in the cluster did not crash at this time, and maintained normal communication with other followers. At this time, when the follower communicates with other followers, the other followers will tell him that the boss is still alive, and at this time, the follower only needs to establish communication with the leader.
Leader really crashed, at this time all nodes in the cluster will communicate with each other, when it is learned that the boss is hanged, each node will open the boss mode, each will send the latest epoch, zxid and myid of the current node to participate in voting, at this time each node will refer to epoch>zxid>myid for leader election, and the final number of votes exceeds the number of nodes in the cluster will become the new leader.

This post-crash leader election mechanism is also easy to understand, if the leader hangs, the last node in the cluster to do (epoch) leader is preferred as the new leader node, followed by the node with the latest transaction commit (zxid) as the leader, and finally voted according to the default maximum machine number (myid).

Intuitive understanding: Zookeeper distributed consensus protocol ZAB

Message broadcast

Crash recovery

Leader election when the cluster starts

Leader election when the Leader crashes