Wearing Off the Elbow Patches

Alistair Atkinson's Blog

Category: Distributed Systems

A Distributed Tuple Space for Scalable Scientific Computing

It turns out I’m not above a small spot of vanity publishing. In this case, my thesis that I completed a while back.

A Distributed Tuple Space for Scalable Scientific Computing

It’s a nice little perk for completed PhDs: the opportunity to get your thesis published as a monograph through academic publishers such as LAP. It ultimately means very little as it’s simply a re-packaging of your thesis, but it’s nice to have a professionally-published version to keep.

So, if you do search for me on Amazon, you’ll now get a a result for my (rather exorbitantly priced) repackaged thesis! Now I just need to sit back and wait for the royalty cheques to roll in…

Caching by Mail

A nice little bit of nostalgia from Stuart Cheshire’s 1996 article ‘It’s the Latency, Stupid’ in its discussion of caching:

One of the most effective techniques throughout all areas of computer science is caching, and that is just as true in networking…

Recently companies have started providing CDROMs of entire Web sites to speed Web browsing. When browsing these Web sites, all the Web browser has to do is check the modification date of each file it accesses to make sure that the copy on the CDROM is up to date. It only has to download files that have changed since the CDROM was made. Since most of the large files on a Web site are images, and since images on a Web site change far less frequently than the HTML text files, in most cases very little data has to be transferred.

I have no recollection of ever encountering this, seeing as I was all of fourteen at the time this was written; maybe that’s why I find it amusing. It’s certainly a great example of Sneakernet in action, and it also illustrates just how much the web’s changed since, and and how unbelievably static is was in its original form.

As they say: never underestimate the bandwidth of a 747 full of tapes. Or, in this case, a postie bike.

Yield and harvest as a way of thinking about fault-tolerance

I was put onto an excellent article by Coda Hale titled ‘You Can’t Sacrifice Partition Tolerance‘ which expands on some of the ideas laid out in Dr Eric Brewer’s CAP Theorem.

The CAP Theorem is the distributed systems version of the classic engineering maxim of “Good, fast, cheap: pick two”. Briefly stated, it declares that any distributed system may only have two of the following properties: consistency; availability; partition tolerance. It makes perfect sense when you think about it, as it’s logically impossible to remain both consistent and available in the case of the system becoming partitioned due to node failure/disconnections etc.

The case raised by Hale is that, for any distributed system, partition tolerance is not something that can simply be chosen. That is, disconnections and node failures etc are guaranteed to happen at some point in time. They are simply an annoyingly constant fact of life for anyone involved in the field. Therefore, any real-world distributed system must support some kind of fault-tolerance, otherwise it’s useless.

This means that the distributed systems designer is left to decide the following: given that system failures will occur, and your system may become partitioned, is your system going to degrade by sacrificing consistency, or availability? It depends on the system requirements as to which should be chosen, however it’s the unfortunate face that you simply can’t preserve both.

Given this, Hale suggests that a better way to think about handling failure is by using yield and harvest. Yield is defined as the impact of unavailability on handling incoming requests (related to but different to uptime), and harvest is a measure of how much of the required data is available to fulfil a request (ie. if a request needs to data from two servers, one of which is unavailable, the harvest will be only 50%).

Hale’s point is that you can choose which of these will be affected in the case of system failures and partitions. Is it better for some requests to be dropped in the interests of preserving accuracy (harvest), or can we handle a lower harvest in the interests of maintaining responsiveness? Again, the system requirements determine the answer.

I find this a very useful way to think about how a system is going to degrade when faults occur. After all, if you design something to be scalable (increase throughput as nodes are added) it makes perfect sense to consider how a system will degrade/scale down as nodes are lost. It can only be beneficial to be completely clear on the choices and tradeoffs involved, and that, ultimately, you can’t have your cake and eat it too when it comes to consistency and availability.