The distributed consensus problem deals with reaching an agreement on a single data value among a group of process connected by an unreliable network. The processes must put forth their candidate values, communicate with one another, and agree on a single consensus value in presence of failures of some processes.

Examples of applications of consensus include:

  • Leader election / Mutable Exclusion
  • Commit or Abort in distributed transactions
  • Reaching agreement about which process has failed
  • Clock phase synchronization
  • Load balancing

A visual diagram of this problem unfolds in the following. Each process pk with an input value uk runs a program to exchange its value. Finally the output of all non-faulty processes become identical. It is permissive that one or more processes may fail at any time, but the output v must be equal to the value of at least one process.

A global view of distributed system is here, from which you can know what kind of role consensus plays.

Protocols that solve consensus problems are designed to deal with limited numbers of faulty processes. A consensus protocol tolerating halting failures must satisfy the following properties.

  • Termination. Every non-faulty process must eventually decide.
  • Agreement. The final decision of every non-faulty process must be identical.
  • Validity. If every non-faulty process begins with the same initial value v, then their final decision must be v.

The Byzantine Generals’ Problem

The Byzantine Generals’ Problem proposed by Leslie Lamport is a fault-tolerant problem in the distributed system, which is the most complex and rigorous fault-tolerant model.

In this model, the system does not impose any restrictions on the nodes in the cluster meaning they can send random data, even error data to other nodes, or choose not to respond to the requests from other nodes. These unpredictable behaviors make fault-tolerance more complicated.

This problem describes a scenario in which a group of generals leads a part of the army. Each general does not know whether the other generals are reliable or not, but they need to choose between attack and retreat.

It will be a serious problem if one of the generals tells that some of the generals attack and the others retreat. Because of a traitor or interception in the transmission, some generals choose to attack while the rest retreat. They all believe that their choices are approved by the majority.

The Byzantine Generals’ Problem is the highest requirement for fault tolerance in distributed system, not a problem in most distributed systems used in daily work. The most frequent problems we encounter cover node down and failure to respond, which is much simpler.

FLP

FLP is one of the most significant theorems in distributed domain offering a proved conclusion.

In a purely asynchronous distributed system, the consensus problem is impossible to solve if even a single process crashes.

Consensus Protocols

2PC

In a distributed system, all nodes can know their own state after performing an operation, but it is impossible to know the state of other nodes. Two-phase Commit (commonly known as 2PC) introduces a coordinator to uniformly control all nodes and guide them to commit the operation result or not.

The execution of 2PC is divided into two phases, the voting phase and the commit phase.

Voting Phase

In the voting phase, the coordinator asks the cohorts whether the operation can be executed. The cohorts will perform the corresponding operation and record the redo and rollback logs, all participants will send an COMMIT or ABORT to the coordinator indicating the result of the operation.

Commit Phase

If all cohorts say yes, the coordinator sends a COMMIT request to the participant, and the participant returns a completion message to the coordinator after finishing the operation and releasing the resource. The entire transaction is ended after coordinator receives the completion message.

In contrast, when a participant decides on ABORT current transaction, the coordinator sends a ROLLBACK request to all participants telling them rollback the operation based on the previous log.

One obvious drawback of 2PC is that it’s a blocking protocol, that is to say, if the coordinator is permanently down or network issue happens, a part of the participants of the transaction will never be able to complete the transaction because of no COMMIT message is received.

3PC

Three-phase Commit protocol solves the above problem by:

  1. introducing timeout,
  2. adding a prepare phase between. If the cohort cannot receiver COMMIT or ROLLBACK, it still release the resource.

The main disadvantage to this protocol is that it cannot recover in the event the network is segmented in any manner. The original 3PC assumes a fail-stop model, where processes fail by crashing and crashes can be accurately detected, and does not work with network partitions or asynchronous communication.

Paxos

Paxos is a family of protocols for solving consensus in a network of unreliable processors.

Roles

Paxos describes the actions of the processors by their roles in the protocol: acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time, not affecting the correctness of the protocol.

  • Proposer. A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur.
  • Acceptor (Voters). The Acceptors act as the fault-tolerant “memory” of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum.
  • Learner. Learners act as the replication factor for the protocol. Once a client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added.
  • Leader. Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved in that case.

Basic Paxos

This protocol is the most basic of the Paxos family. Each instance of the Basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds ending until a successful one with two phases. A Proposer should not initiate Paxos if it cannot communicate with at least a Quorum of Acceptors.

Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{Va,Vb,Vc})
   |         X--------->|->|->|       |  |  Accept!(1,Vn)
   |         |<---------X--X--X------>|->|  Accepted(1,Vn)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

Phase 1a: Prepare

A Proposer (Leader) creates a proposal identified with a number N. This number must be greater than any previous proposal number used by this Proposer. Then, it sends a Prepare message containing this proposal to a Quorum of Acceptors. The Proposer decides who is in the Quorum.

Phase 1b: Promise

If the proposal’s number N is higher than any previous proposal number received from any Proposer by the Acceptor, then the Acceptor must return a promise to ignore all future proposals having a number less than N.

If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal number and previous value in its response to the Proposer.

Otherwise, the Acceptor can ignore the received proposal. It does not have to answer in this case for Paxos to work. However, for the sake of optimization, sending a denial (Nack) response would tell the Proposer that it can stop its attempt to create consensus with proposal N.

Phase 2a: Accept Request

If a Proposer receives enough promises from a Quorum of Acceptors, it needs to set a value to its proposal.

If any Acceptors had previously accepted any proposal, then they’ll have sent their values to the Proposer, who now must set the value of its proposal to the value associated with the highest proposal number reported by the Acceptors.

If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose the value it originally chose ‘N’.

The Proposer sends an Accept Request message to a Quorum of Acceptors with the chosen value for its proposal.

Phase 2b: Accepted

If an Acceptor receives an Accept Request message for a proposal N, it must accept it if and only if it has not already promised to only consider proposals having an identifier greater than N. In this case, it should register the corresponding value v and send an Accepted message to the Proposer and every Learner. Else, it can ignore the Accept Request.

Note that an Acceptor can accept multiple proposals. These proposals may even have different values in the presence of certain failures. However, the Paxos protocol will guarantee that the Acceptors will ultimately agree on a single value.

Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number. Notice that when Acceptors accept a request, they also acknowledge the leadership of the Proposer. Hence, Paxos can be used to select a leader in a cluster of nodes.

Examples of failure

Failure of Acceptor (Quorum size = 2 Acceptors) is demonstrated in the following graphic representation.

Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |          |  |  !       |  |  !! FAIL !!
   |         |<---------X--X          |  |  Promise(1,{Va, Vb, null})
   |         X--------->|->|          |  |  Accept!(1,V)
   |         |<---------X--X--------->|->|  Accepted(1,V)
   |<---------------------------------X--X  Response
   |         |          |  |          |  |

Here is a graph for failure of redundant Learner.

Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{null,null,null})
   |         X--------->|->|->|       |  |  Accept!(1,V)
   |         |<---------X--X--X------>|->|  Accepted(1,V)
   |         |          |  |  |       |  !  !! FAIL !!
   |<---------------------------------X     Response
   |         |          |  |  |       |

The next failure case is when a Proposer fails after proposing a value, but before agreement is reached.

Client  Proposer        Acceptor     Learner
   |      |             |  |  |       |  |
   X----->|             |  |  |       |  |  Request
   |      X------------>|->|->|       |  |  Prepare(1)
   |      |<------------X--X--X       |  |  Promise(1,{null, null, null})
   |      |             |  |  |       |  |
   |      |             |  |  |       |  |  !! Leader fails during broadcast !!
   |      X------------>|  |  |       |  |  Accept!(1,V)
   |      !             |  |  |       |  |
   |         |          |  |  |       |  |  !! NEW LEADER !!
   |         X--------->|->|->|       |  |  Prepare(2)
   |         |<---------X--X--X       |  |  Promise(2,{V, null, null})
   |         X--------->|->|->|       |  |  Accept!(2,V)
   |         |<---------X--X--X------>|->|  Accepted(2,V)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

Raft

Raft is a consensus algorithm designed as an alternative to Paxos, which is more understandable by means of separation of logic.

Raft achieves consensus via an elected leader. A server in a raft cluster is either a leader or a follower, and can be a candidate in the precise case of an election (leader unavailable).

The leader regularly informs the followers of its existence by sending a heartbeat message. Each follower has a timeout in which it expects the heartbeat from the leader. The timeout is reset on receiving the heartbeat.

Leader Election

If no heartbeat is received, the follower changes its status to candidate and starts a leader election.

In this case, a new term starts in the cluster. A term is an arbitrary period of time on the server during which a new leader needs to be elected. Each term starts with a leader election. If the election is completed successfully (i.e. a single leader is elected) the term keeps going with normal operations orchestrated by the new leader. If the election is a failure, a new term starts, with a new election.

A leader election is started by a candidate server. A server becomes a candidate if it receives no communication by the leader over a period called the election timeout, so it assumes there is no acting leader anymore. It starts the election by increasing the term counter, voting for itself as new leader, and sending a message to all other servers requesting their vote.

A server will vote only once per term, on a first-come-first-served basis. If a candidate receives a message from another server with a term number at least as large as the candidate’s current term, then the candidate’s election is defeated and the candidate changes into a follower and recognizes the leader as legitimate. If a candidate receives a majority of votes, then it becomes the new leader. If neither happens, e.g., because of a split vote, then a new term starts, and a new election begins.

Log Replication

The leader is responsible for the log replication and also accepts client requests. Each client request consists of a command to be executed by the replicated state machines in the cluster.

After being appended to the leader’s log as a new entry, each of the requests is forwarded to the followers as AppendEntries messages. In case of unavailability of the followers, the leader retries AppendEntries messages indefinitely, until the log entry is eventually stored by all of the followers.

Once the leader receives confirmation from the majority of its followers that the entry has been replicated, the leader applies the entry to its local state machine, and the request is considered committed.

Reference