MIT Distributed Systems Pt 1. - Map Reduce and VM Fault Tolerance
I am working through the MIT distributed systems lab, and I wanted to share my experience and post on some cool things that I’ve been learning. I really appreciate courses that post all their materials online, so out of respect for the course, I wont be sharing any of my code online. I figure this will take up a big chunk of my time, however, so I want to log my journey through the course. The course is in Golang, so it will be nice to solidify my Golang skills.
Test Cases as Feedback
One really useful benefit of the available labs is they provide a pretty good set of test cases to run your implementations on. I think writing comprehensive test suites is an very useful skill to have, but when you’re crammed on time and busy doing other things, it is nice to be able to focus on the learning and implementation; the tests are a free feedback mechanism in lieu of writing your own tests, having a teacher to look over your assignment, and so forth. And they are a really useful one – I’m still debugging my Map Reduce lab but the tests helped me narrow down a really crucial bug that I may have missed otherwise.
Map Reduce Lab
The first lab was to implement Map Reduce in Golang. At first I was a bit hesitant – no one uses map reduce anymore. Map reduce seems a bit outdated, but in the process there’s a good reason they keep it as lab 1 even 21 years after the map reduce paper was published, and map reduce has seemingly become legacy software. It’s a great warm up for learning about distributed systems in Go. The paradigm is so simple, and it’s quite easy to learn a basic implementation of it in and out. The paper even shows the original source code implementation of it at Google, which I thought was pretty neat.
The first real benefit was getting acquainted with Remote Procedure Calls (RPCs) in Go. I haven’t used these before, and I got very comfortable with the Go package for RPCs, which is easy to do because Go makes it very simple. It also abstracts a lot of the network stuff going on under the hood, which makes it a very useful abstraction for distributed programming: either the RPC call went through or it failed.
I still have one bug left in the MapReduce lab, and I think I know what it is. The crucial test case in the suite is the final test case, the crash test. Sometimes it passes and sometimes it doesn’t, due to the non-determinism of the crash test case. Basically, a bunch of workers are spinned up and they either crash, or stall and then come back into action. The latter case is actually more tricky to handle, and this was my first brush with the challenges of working with distributed systems. If there is a network partition and a worker fails, the server might decide to reschedule the worker’s task. So then you have another process working on the same task. If the original worker didn’t fail though, they might be stepping on each others toes. The lab gave a useful hint, I realized, to do the good ole systems trick of writing to tmp files and then doing an atomic rename when the operation is complete.
I’m looking forward to the next labs, in particular I can’t wait to get to the multi part implementation of Raft! :-) Thanks MIT.
VM Fault Tolerance; State Replication vs. State Machine Replication
This was really cool. The lecture covers the fault tolerant virtual machines paper. The basic idea is to provide fault tolerance of a single core VM by having a backup that uses state machine replication to maintain a copy, as well as a storage server with a flag that handles network partition or failure of the primary or backup.
I learned about the distinction between state replication and state machine repliation here. Imagine you have 16GB of memory that you want to have replicated between a primary and a backup. One way to do this is to transfer all 16GB to the backup at checkpoints, and copy over all the data. This can be very expensive if there is a lot of data that needs to be transferred – imagine you only modified a fraction of the data.
Another method is state machine replication. You can look at the primary and back up as starting in the same initial state, and applying the same sequence of operations to remain in lockstep. If the operations are deterministic, then you just need to send over what operations were performed to the backup, and the backup has to perform those operations. If your state machine only has deterministic operations, that’s about it. If there are some non deterministic steps, like getting the time for example, then that would have to be some data that is explicitly transferred.
The idea of replicating an entire VM was really fascinating for me. The real “aha” moment when I was reading this paper and watching the lecture was realizing that you don’t even need to send most instructions over the network – for deterministic sequences of code, the backup can just run from the state that it’s in, and mimic the primary exactly. The only thing that needs to be sent over the network are the non-deterministic parts: hardware interrupts, client packets, etc. The VM hypervisor on the primary can say the exact moment the hardware interrupt occurs, and send that over the network so that the backup VM also creates an interrupt at the same exact location.
One interesting downside with the original paper on this was the lack of support for multi-threading. This can’t be handled at the VM level since the VM can’t track the exact interleaving of instructions at the CPU level for multiple cores/threads. Apparently there is support for this, but it is outside of the scope of this paper.