I was put onto an excellent article by Coda Hale titled ‘You Can’t Sacrifice Partition Tolerance‘ which expands on some of the ideas laid out in Dr Eric Brewer’s CAP Theorem.
The CAP Theorem is the distributed systems version of the classic engineering maxim of “Good, fast, cheap: pick two”. Briefly stated, it declares that any distributed system may only have two of the following properties: consistency; availability; partition tolerance. It makes perfect sense when you think about it, as it’s logically impossible to remain both consistent and available in the case of the system becoming partitioned due to node failure/disconnections etc.
The case raised by Hale is that, for any distributed system, partition tolerance is not something that can simply be chosen. That is, disconnections and node failures etc are guaranteed to happen at some point in time. They are simply an annoyingly constant fact of life for anyone involved in the field. Therefore, any real-world distributed system must support some kind of fault-tolerance, otherwise it’s useless.
This means that the distributed systems designer is left to decide the following: given that system failures will occur, and your system may become partitioned, is your system going to degrade by sacrificing consistency, or availability? It depends on the system requirements as to which should be chosen, however it’s the unfortunate face that you simply can’t preserve both.
Given this, Hale suggests that a better way to think about handling failure is by using yield and harvest. Yield is defined as the impact of unavailability on handling incoming requests (related to but different to uptime), and harvest is a measure of how much of the required data is available to fulfil a request (ie. if a request needs to data from two servers, one of which is unavailable, the harvest will be only 50%).
Hale’s point is that you can choose which of these will be affected in the case of system failures and partitions. Is it better for some requests to be dropped in the interests of preserving accuracy (harvest), or can we handle a lower harvest in the interests of maintaining responsiveness? Again, the system requirements determine the answer.
I find this a very useful way to think about how a system is going to degrade when faults occur. After all, if you design something to be scalable (increase throughput as nodes are added) it makes perfect sense to consider how a system will degrade/scale down as nodes are lost. It can only be beneficial to be completely clear on the choices and tradeoffs involved, and that, ultimately, you can’t have your cake and eat it too when it comes to consistency and availability.