My insight after a decade in distributed systems

Ten years is plenty of time to accumulate generic career wisdom which I share on my personal page with my (poorly written) consolidated writings, but I want to focus on the key technical aspect of distributed systems. There are a lot of fundamentals to contend with, but I think the hardest to deal with is state. State is hard, and many people don’t want to deal with it (for very good reason).

Every distributed system that contends with state either has a single point of failure, has a broken protocol, or sits on top of paxos/raft. The big services that are useful (i.e. SQS, S3, Dynamo) are essentially data structures using a replicated state machine and transaction logs. These services can be weaved together to create a wide swath of products. Furthermore, these services are great businesses as there is a chasm of latency between the developers compute and the service, so there is always a long queue of things to add to a service. Once customers use your service, they always ask for things because leaving is exceptionally hard.

I’m emerging a thesis that having separate services for various data structures is exceptionally wasteful.

Suppose you build an exceptionally fast storage engine around your unique business needs and your data structure. From an efficiency perspective, you can pack a great deal of storage within a single machine and then put within a raft cluster. Fantastic! But, now reading and writing that data requires many investments. You’ll need a high throughput protocol to read and write data, encryption to protect user data (thanks NSA), host selection, shard management, health checking, metrics, capacity management, re-balancing, a public protocol (i.e. HTTP), authentication and more. The engineering is well defined, but expensive. For most businesses, this novelty is a liability as the opportunity cost is radically high.

This forces many developers into weaving things together which means waste accumulates around networking overhead, CPU usage to serialize and deserialize, and new failure modes to contend with. Most businesses just run with the elevated cost, moderate latency, and questionable reliability (torn writes). I believe we can do much better.

The first insight is born from the realization that the exceptional engineering costs are from everything that isn’t your bespoke data structure. With that in mind, why can’t we have a single distributed system to handle everything? Well, we have had many attempts, and your typical relational database fits the bill for a wide number of use-cases. The existence of so many services and competing databases around different patterns imply a deficiency on some dimensions. However, if you care about durability, then they all boil down to a replicated state machine using paxos/raft as a log.

My insight is to wonder why can’t we have a virtual machine target a replicated state machine? Why can’t I ship my novel data structure, describe how to mutate it or query it, and then get everything else for free? If I were to go back to academics, then this would be my research area. The reason is that I believe answering this question is going to take decades, but the tasty fruit is efficient computation, low latency, and exceptional reliability.

The fun thing about answering a “can’t” question is that does spawn answers which you can contend with, so let’s go into a few of them. First, a reason you can have a generic VM is that the way we think about VM state today is chunky and large. A typical VM has a blob of flat memory which the host hands over to the guest. First, this means that changes to the memory are not cheaply captured to a log. Second, the VM state is not easily up-gradable between deployments. These answers beg the question if there is a new type of VM which is (a) differential for efficiency or replication, and (b) upgradable between logic deployments.

As a case study, we can look at durable objects. Here, Cloudflare is providing a mechanism for a worker to bring an object local to the VM. The differential is direct via an API, and the object’s state is basically a key-value map which provides durable state between logic upgrades. This is a great market offering, and I’m curious about the feedback that Cloudflare has received about it. I’m a fan, but I may be biased as I designed a similar service for Meta. A key problem is that worker state held within the VM exclusively is at risk, so there are the developer ergonomics of persisting state versus running with scissors. The Durable Object API is basically a KKV system (key (object id) → (key value pair object)) as the data structure.

If you want a different data structure, then we can return to the question of building a VM that targets a replicated differential state machine. The new differential keyword solves both efficient replication and upgradability critiques. At first, It is not obvious that upgradability follows from differentiability; the proof is to realize that to produce a difference, you read the current state, execute the logic, then emit a difference. Since logic is stateless, it can change.

Answering this question of why can’t we build a replicated differential state machine to produce arbitrary distributed systems requires the experience that every state machine induces second order state machines. For instance, a raft cluster for a key value pair is going to get slaughtered on read load at web scale. Most distributed systems will detect heat, cache a copy, and then serve the thundering herd from the cache. This creates the exceptionally hard cache invalidation problem, and solving this requires inferences around how data mutations affect the cache. For many well defined domains, this is a solved problem.

For an arbitrary state machine, the second order read load can again be solved by the differentiability aspect. We build the same heat detector to detect when a state machine is getting hot, and instead of replicating a static copy, we can replicate a live copy using a stream. This comes at a greater expense since streams are non-trivial, but the cache can serve relatively recent data without novel engineering per data structure.

For a case study, we can look at litestream which fly.io has picked up. Here, the WAL is used to keep replicas in sync. This has the benefit of reading the database can be local to the logic which is bloody fast which is in alignment with my thesis.

Hopefully, it is clear that we are on a better path towards a brighter future where things are faster, more reliable, and cheaper. It just requires work, and here I am working on Adama. So, let’s get to the sales pitch since this blog is a content funnel.

If your novel data structure can be persisted as a JSON document, then Adama is an existence proof of a differential state machine which can feed a raft cluster. There is a bunch of work to get to the final point in my vision which is why early access is crucial for honesty.

Here, Adama has a new language which allows you to lay your state out within a document schema, describe how the state mutates via message passing, and then provides reactivity to connect computation to your state. Adama is in early access because there is a lot of novelty being explored which requires building a bunch of things from ground up. If you find this interesting, then please drop into the discord chat and say hello. You can get started today, but there are some rough edges.

May 31st, 2022 My insight after a decade in distributed systems By Jeffrey M. Barber