2019 started with a bang at work. My school organized a research week with a series of distinguished research talks delivered by world-class researchers. I hardly came away from any talk without learning anything new or interesting. The first day covered distributed systems, in glorious theoretical depth and practical complexity. Having only recently gotten less scared of this field, I left Valerie King’s presentation with wholly new understanding of the FLP theorem. Which is to say that my previous, though unstated, view of the theorem was wrong. Well, not as much wrong as incomplete. And that prompted me re-reading some papers, then spent several days wrestling with the original FLP paper. This blog is intended to capture what I understood so far. Knowing how subtle distributed systems are, it will not be the last on FLP and consensus protocols.
FLP and its implication
At a high level, FLP theorem says that consensus in an asynchronous network is impossible in the presence of node failure. This holds even for a single crash failure. We are reminded that crash failure is the most simple form of failure in distributed systems. The FLP result, as my initial understanding goes, has this implication:
In practical terms, we cannot simultaneously have safety (all non-faulty nodes agrees/decides on the same value) and liveness (the protocol terminates with a decision). Therefore, every consensus protocol must either sacrifice one property, or relax assumption about network asynchrony.
Most state-of-the-art consensus protocols follow the second strategy. PBFT, for example, assumes partially synchronous networks in which there exists an unknown bound on message delivery. PBFT achieves safety even in fully asynchronous network condition, and liveness under period of synchrony, as permitted by FLP.
So far, this view of FLP is not wrong. But it is not the whole truth. What I’ve been missing is that
FLP concerns only deterministic consensus protocols.
(This fact is stated so clearly in the original paper that I have only myself to blame for my initial, shallow understanding of one of the most important results in computer science.)
Now before I go and repent for this gross oversight, it is worth mentioning that I did sense something amiss when reading Andrew Miller’s HoneyBadger protocol (HBFT) which claims to work in an asynchronous systems. Having failed to understand all its details, I drew a conclusion that HBFT circumvented FLP by sacrificing safety: it achieves agreement only probabilistically. Well, now that was wrong!
A deterministic consensus protocol, as properly defined in FLP, means that nodes in the protocol behave deterministically. Determinism is a property of of node behavior, not of the final outcome. More specially, a node output (its decision) is entirely determined by the sequence of input events the node receives. All protocols that I had so far known of are deterministic. PBFT, Raft and Paxos, for examples, contains no randomness. When they receives a message, there path of execution is completely deterministic. Of course, different orders of inputs lead to different paths; but again that mapping of sequence of input to execution (and subsequently to output) is completely deterministic.
So what does a non-deterministic consensus protocol looks like. Well, opposite of a deterministic thing is a randomized (or probabilistic) thing. A randomized algorithm’s output depends not only on its external input, but also on a random coin which is flipped during its execution. So, given two exactly the same inputs, the algorithm may output two different values.
FLP theorem does not apply to randomized protocol. Viewed this way, it is a positive result, giving researchers another way to engineer practical consensus protocols. There are only a small numbers of such algorithms, the most famous one is by Ben-Or in 1983, and they all share the same structure as follows:
Each node has a source of randomness. Some algorithms allow the node to have its private, local coin; other optimize for cases when there exists a global coin. All algorithms proceed in many rounds of voting.
Each node broadcasts its value to the network, and waits for messages to come back. This constitutes a round of execution, after which the node moves on to another round. Note that it cannot wait for more than messages, due to network asynchrony.
There are some thresholds defined in and . If the node receives more than messages with the same value, it can decide immediately.
There are some other threshold , below which the node tosses the coin and vote for that random value in the next round.
The probability that a node will decide at round is of this form
which means as . State-of-the arts randomized protocols compete with each other by lowering the expected number of rounds by which the nodes decide. Many trade fault tolerance (the ratio of against ) for constant and low expected number of round.
Why would it work? One direct question to ask is why, intuitively, non-determinism helps circumventing FLP. The core idea here is that in asynchronous networks, we assume the worst, that is the adversary has total control over message delivery with the only constraint that it will eventually deliver messages. In deterministic settings, sequence of messages is extremely important. In fact, the adversary in FLP craftily delay messages so that the nodes can never decide. In contrasts, the effect of message delivery (and order) is less important in non-deterministic setting, because the node behavior is also dependent on some randomness that the adversary cannot control. In other words, the adversary exerts less power in this setting, and therefore cannot always prevent the algorithm from reaching agreement.
Now let us formally describe FLP. As many classic distributed systems papers, FLP is deceptively short and sweet. But only by re-reading it so many times, together with working backwards, that I am able to feel how the proof works.
Each node is fully specified by its state and a transition function. The state consists of the input value, the output value (starting with a default value ), and any temporary variables during execution. The transition function is the meat. It is invoked by a receiving a message from the network:
- Given the message and current state as input, it deterministically moves the node to a new state.
- It may send many messages to the network during execution of the transition function.
Nodes communicate through the network which is modeled as a centralized buffer. Controlled by the adversary, it consists of messages exchanged between node. A message is of the form where is the message destination, is the content.
An asynchronous network is characterized by the fact that the adversary controls when a message is delivered. In particular, the adversary can reorder messages in the buffer and insert arbitrary number of empty (or null) messages to the buffer. For instance, instead of delivering a message to , the adversary can simply deliver . In either case, ’s transition function is invoked. The real message can be delayed for arbitrarily long.
However, delayed messages is not the same as lost messages. The asynchronous network does provide a form of reliability: over infinite period of time, will eventually be delivered. More specifically, if infinitely asks the buffer for a message, and there exists a in the buffer, will receive it. The problem here is the word infinity, which in this case means there is no bound on the delay. In contrast, in partially synchronous settings, there is an unknown, yet finite period after which is delivered.
A configuration C comprises states of all the nodes in the system, plus the message buffer. At any given time, represent the system’s global states. At a configuration , any node may take a step by receiving a message from the network, then execute its transition function, thereby changing its state and consequently move the system to a new configuration . We say that is the result of applying to , written as .
An initial configuration is a configuration where the buffer is empty and every node is in its initial, fresh state. Every consensus protocol starts from an initial configuration.
A schedule, or run consists of a sequence of steps . Such a sequence can be of infinite length, since there can be an infinite number of in the buffer. A schedule can be applied to a configuration (which is not necessarily an initial configuration), by iteratively applying events in to . For example, , then .
A configuration is accessible for a protocol , if there exists an initial configuration and a schedule such that .
####Concerning safety A configuration has a decision value if there exists such that output value is . Another way to say it is that decides on . Note that may have more than one decision values. But we are only concerned with valid (or safe) protocol such that:
Every accessible configurations has only 1 decision value. And for all there exists an accessible configuration with that value.
This definition of validity is in fact weaker than what we commonly require for consensus. Here, it only needs some nodes to reach a decision, whereas ordinary consensus is valid only if all non-faulty nodes reach a decision. Nevertheless, if it can be proven that this weaker version is impossible, there is no hope for the stronger version.
A deciding run is a schedule whose final configuration has a decision value. Recall again that we are only dealing with valid deciding runs, in which the configuration has only one decision value.
A configuration is bivalent if there exists two deciding runs that lead to configurations with decision values and respectively. is -valent (or -valent) if all reachable configurations by deciding runs have decision value of (or respectively). In layman term, a bivalent configuration is one in which the protocol can go both way, i.e. there exists an important event that flips the decision one way or another. Once such event occurs, the protocol enters an univalent state which it can never goes back. Consider .
If is bivalent, can be either bivalent or univalent.
If is -valent, is also -valent.
A non-faulty node is one that all messages sent to it will be delivered, whereas a faulty node stops receiving messages at some point. What interesting is that we can define faulty behavior through the message buffer:
- If a node is not faulty, it infinitely asks for messages and the buffer will eventually deliver.
- Otherwise, the message buffer stops delivering messages to the faulty node.
An admissible run is a schedule in which there is at most one faulty node. Let be a schedule. If is not faulty, all messages entered the buffer will appear in the sequence. If is faulty, there is a point in after which no event is delivered.
Putting it together
A correct consensus protocol under one faulty node is one in which every admissible run is a deciding run. And no such protocol exists.
Here we have the familiar definition of consensus protocol: validity (deciding run) and termination (admissible run).
The proof of FLP is quite layered. But each layer is short and sweet. Together they remind me of an old saying that in mathematics and science, short theorems accompanied by short proofs are often the most beautiful and the most impactful.
The proof is based on contradiction. By assuming that a correct consensus protocol exists, we are going to build up a number of lemmas which are put together to derive contradiction. I am going to highlight these lemmas’ core ideas and some subtlety that I struggled with.
Lemma 1 (diamond property): from a starting configuration , if two schedule does not have any node in common (if , there is no message delivered to , then . In other words, two disjoint schedules can be applied commutatively to a configuration. This is true because one message affects only the node that receives it, and because the configuration is merely an aggregation of all nodes’ states.
Lemma 2 (starting at a bivalent configuration): if exists, then there exists a bivalent initial configuration. The implication here is that one can start from a point at which the decision is not yet determined, meaning that the final decision is dependent not only on the starting inputs, but also the behavior of the message buffer. The proof goes like this: assume the contrary, that is all initial configurations of are univalent.
One requirement for correctness is that there exists two different, accessible configurations that decides on and on respectively. And because -valent configuration can only lead to configurations that decide , there must exists two initial configuration and that are - and -valent respectively.
Consider an initial configuration which is -valent. By changing the starting value of one node, we get an adjacent configuration which is also an initial configuration. Let assume again that is -valent. By applying a chain of such transformation (changing starting value of the nodes, one at a time) we can any other initial configurations from . One of such reachable configurations must be -valent. Therefore, there exists two initial configurations and differing only in the starting value of one node , and they are - and -valent respectively.
Consider and an admissible run in which is faulty. This is allowed because can tolerate one fault, which may as well be . And being correct means is a deciding run. It must therefore decide on . Now apply to . By Lemma 1 this should lead to the same configuration as with , which decides on . But this is not possible, since is -valent.
So it means there exists a bivalent initial configuration.
Lemma 3 (maintaining bivalent configuration): if exists, given a bivalent configuration , let be an event applicable to . We delay , then apply it to every reachable configurations from to get a set of configurations . is the result of applying last, and it contains a bivalent configuration. The implication here is that the adversary, being able to control the order of message delivery, can guide the system from one bivalent configuration to another configuration. The proof is again by contradiction: assume that does not contain bivalent configurations.
- contains both - and -valent configuration. The reasoning for this is quite straightforward in the paper so I’ll ignore it here.
There exist two neighboring configurations reachable from such that applying to them results in a - and -valent configurations in . More specifically, there exists reachable from and such that or , is -valent and is -valent. The paper states that this fact can be derived via “easy” induction, which I find anything but.
First, and are not necessarily - and -valent. Recall that an -valent configuration can be reached directly in two ways: from an -valent or a bivalent configuration.
Second, let us find such two neighbors and , keeping in mind that any bivalent configuration reachable from must become univalent after applying (due to the contradiction assumption). Recall that is bivalent, hence there is a path to another bivalent configuration after which there can only be univalent configurations (see the Figure above). Extending and with will lead to a - and -valent configurations in . Now let us consider . Because , it must be univalent (remember the contradiction assumption). If is -valent, then we can assign and . Otherwise, and .
Without loss of generality, let and . There are two further cases: Figure 2 and 3 from the original paper
- If (Figure 2), then thanks to the diamond lemma. But this means leads to both and , violating the assumption of non-bivalency in .
If (Figure 3), then there exists a deciding run in which is faulty. I spent two days trying to show why exists. Under the assumption that exists, we only need to show that is a part of an admissible run by (remember, correctness of means every admissible run is a deciding run). Let’s say for some initial state . The question becomes whether there exists in which makes no step, and is an admissible run.
Yes. The adversary can construct based on as follows. It stops delivering all messages destined for , and makes sure all messages for are delivered. The resulting schedule meet the definition for an admissible run. My original struggle with this lemma was because I thought if contains one faulty node different to , then cannot be constructed. However, the liberating insight here is that even if appears as faulty in , we can make it non-faulty again in by delivering all of its messages. The only important trace is , which is admissible because all except got all of their messages.
Let us continue. Now that exists, applying the diamond lemma again lead us to contradiction. If decides on , it can be moved to a state that decides on , meaning that two nodes decide two different values. Hence, ’s validity is violated. The same argument applies if decides on .
The final proof
With Lemma 2 and 3, we are going to construct an admissible run that does not decide, which contradicts the assumption that exists. The algorithm goes like this:
- Start with a bivalent initial configuration , which exists by Lemma 2.
- Choose a message for node which is at the head of a node queue (the queue starts out in random order). Find a run from that ends with at a bivalence configuration . Such a run exists because of Lemma 3.
- Set , put to the back of , and repeat step 2.
This algorithm has an admissible run because after infinite steps all messages are delivered. But it never decides on any value, contradicting the assumption that every admissible run is a deciding run (or terminates). QED.