The path towards reliability (and scale) for great sleep

As I have committed towards an open source SaaS, I am designing how to handle the failure modes since I like sleep. I’ll have much to say about the role of design on great sleep, but today I’m looking at the path towards launch and asking if there is an opportunity to provide trade-offs to the future consumer as this unfolds along with the related business model.

For clarity, each diagram here represents a single cluster within a single region under a single DNS name. I’m not sure if I want to approach multi-region design; however, I am considering what it would mean for multi-AZ design.

Ultra YOLO

We start with the simplest and most problematic. Take the code and then run the WebSocket frontend with Adama directly coupled within the process.

a single service

Pros

Easy for me, a single command
Simple to reason about
Cheapest offering
Lowest possible latency

Cons

Too easy, I need to suffer more
Not durable
Unable to survive updates to both Web and Adama
Reliability depends entirely on the lifetime of the host
Scale is exceptionally limited to that single host
Yearly data loss chance: 100%
Easy to denial of service

Does this have legs? This could be used as an ephemeral real-time solution since it is just a state machine routable via DNS. The business here is to resell hosts on the edge, and this could work exceptionally well with something like fly.io.

Mega YOLO

We take the Ultra YOLO and split the process with a front-end web tier connecting to an Adama host via gRPC.

a single service

Pros

Simple to reason about
Almost as cheap as Ultra, but the communication overhead reduces total capacity
Still pretty quick
Provides a natural pivot to double the scale of Ultra Yolo

Cons

Not durable
Unable to survive updates to Adama
Reliability depends entirely on the lifetime of the host
Scale is exceptionally limited to that single host
Yearly data loss chance: 100%
Easy to denial of service

Does this have legs? See Ultra Yolo as this is similar with minor extra costs and additional latency where the key gain is the ability to update Web instance.

Since the key failure mode would be Adama deployments, this begs the question if ephemeral real-time solution could benefit from a state hand-off protocol where a new service is stood up and then takes over the state. This would bring the data loss chance from 100% to 5% unless there is operator error. The problem with this line of thinking is that it will demand perfect operations.

YOLO

We take the Mega YOLO and introduce a database into the mix. I’m currently using MySQL because it’s old as dirt and reliable enough.

a single service

Pros

Simple to reason about
Best combination of latency and pretend-responsibility
Able to survive software updates

Cons

Not long term durable
Reliability depends entirely on the lifetime of the host
Scale is exceptionally limited to that single host
Yearly data loss chance: 5%
Easy to denial of service

Does this have legs? It all depends on latency. The key problem is that this fakes quality with a submarine lack of durability. This will also depend on the domain, so this could be a legit option to offer people. This would be great as a small developer offering for testing.

Durable

Replace managing a database with using an existing DB as a Service (i.e. RDS) which handle everything for you. The key requirement of the DBaaS is that it provides transactions within a shard holding a document’s log. The transaction is an atomic append, and this could be replaced with a raft logger as well.

a single service

Pros

Still simple to reason about (until you build your own DBaaS)
Durable
Durability is handled by someone else
Able to survive software updates

Cons

Reliability depends entirely on the lifetime of the host
Scale is exceptionally limited to that single host
Easy to denial of service
Vendor lock-in to DBaaS

Does this have legs? At this point, this can look like a “buy an Adama host” and let customers do sharding within their app (i.e. pick a DNS). I feel this is a cheat, but this is the first offering that at least feels responsible.

Defense against the interwebs

We can mitigate the denial of service by introducing multiple web proxies to front-load Adama traffic

a single service

Pros

Somewhat simple to reason about as this mirrors
Durable
Durability is handled by someone else
Able to survive software updates

Cons

Reliability depends entirely on the lifetime of the host
Scale is exceptionally limited to that single host
Vendor lock-in to DBaaS

Does this have legs? This extends the business of front-loading Adama’s traffic, and to some degree the business competes with cloudflare for preventing denial of service attacks. This begs the key design question of whether customers should manage their own sharding logic.

Expensive Load Balancer

Or, we can build an expensive high-throughput load balancer which can route traffic to Adama’s shard. The key aspect is that the WebProxy is just shuffling bytes between the user and Adama. Adama has several burdens like managing the document in memory, transforming messages into storage deltas, computing privacy and keeping clients up to date. Early tests on a $5/mo host indicates that memory is the key bottleneck followed by CPU.

a single service

Pros

Simple to reason about
Durable
Durability is handled by someone else
Able to survive software updates
Scales Adama up dramatically due to sharding Adama
Reliability of Adama is managed by that expensive load balancer.

Cons

Scale and reliability all depends on the expensive load balancer which is a single point of failure and choke point.
Easy to denial of service
Vendor lock-in to DBaaS

Does this have legs? This will dramatically increase scale, but the single point of failure becomes an over-promise. Fortunately, it is stateless, so recovery should be quick.

Condenser

We can combine the distributed load balancer with the expensive load balancer. We rename that single load balancer as a condenser and then optimize the protocol as it has one job: routing streams. This lets us radically simplify the expensive load balancer aspect such that it has one job. We can then offload TLS, auth, and administration based APIs to the WebProxy.

a single service

Pros

Durable
Durability is handled by someone else
Able to survive software updates
Scales Adama up dramatically due to sharding Adama
Reliability of Adama is managed by that expensive load balancer.
Compute offload to the WebProxy reduces condenser and Adama burden
Static sharding could greatly increase scale and mitigate catastrophic reliability issues.

Cons

Stack is getting longer with many pieces
Latency is suffering multiple hops
Scale and reliability all depends on the condenser which is a single point of failure and choke point.
Vendor lock-in to DBaaS

Does this have legs? That single point of failure may be a deal-breaker for the unlucky customers bound to a dead condenser, but the initial web proxy fleet could be amortized across all customers and a customer could have a dedicated

Distributed Load Balancer

Fixing the reliability of the choke point is as of the same of having a distributed load balancer between the web proxy and Adama tier. This introduces an exceptionally hard routing problem between the WebProxy and Adama tier. Ideally, a document lives on exactly one Adama host. We should expect (and embrace) the possibility that two or more Adama hosts will have the same document loaded. This then forces the DBaaS to reject writes from the least up-to-date Adama host. When the not-so-up-date Adama host sees the failure, it requests a patch from the DBaaS to catch up, and it will then retry. This logic has been implemented, so progress is made. However, it comes with the burden that latency of users learning of updates depend on the frequency of every Adama host being active.

a single service

Pros

Durable
Durability is handled by someone else
Able to survive software updates
Scales Adama up dramatically due to sharding Adama
Reliability of Adama is managed by consensus and fail-over
Compute offload to the WebProxy reduces Adama burden

Cons

Requires reliable failure detection
Requires consistent routing
Requires DBaaS to detect when routing is split
Stack requires many new things to do all this stuff
Latency will have more variability due to the failures induced
Vendor lock-in to DBaaS

Does this have legs? This is potentially the best case if failure detection can be brought in cheap enough. The key gap that requires to be crossed this time is handling the failure modes of the DBaaS failing a write due to a conflict emerging from a split brain. There are a variety of fixes for this which will be discussed in a later post.

This is the model that I intend to launch with since I can leverage a gossip style failure detector to detect when capacity is added/removed and when hosts fade into the night. Since the goal of gossip is to achieve a steady state, we can then leverage rendevous hashing to bind a document to a host. The trick is that binding must be reactive such that association changes manifest in the WebProxy changing the binding. This forces retry to be located within the WebProxy.

Distributed Load Balancer plus Sharded Database

The latency of the system will depend on the DBaaS, and we can take over the DB role. Assuming we build out the DB with replicas. This will influence the design of how the WebProxy tier chooses Adama host since it doesn’t make sense to have multiple full socket meshes especially with most database clients requiring many threads to achieve concurrency.

a single service

Pros

Durable
Able to survive software updates
Scales Adama up dramatically due to sharding Adama
Reliability of Adama is managed by consensus and fail-over
Compute offload to the WebProxy reduces Adama burden
No longer vendor locked

Cons

Requires reliable failure detection
Requires consistent routing
Requires DBaaS to detect when routing is split
Stack requires many new things to do all this stuff
Operational burden for managing databases

Does this have legs? As of yet, I have not produced a solution that I like beyond using a DBaaS, but I suspect this will providing guidance on the protocol between WebSocket and Adama for how to do routing. Ideally, I would probably just use MySQL router or existing cluster software since this is a giant pain in the ass. The hard part of managing databases is the migration of data.

Distributed Load Balancer with Chain Replication

Let’s go wild! If the DB schema remains simple, then I could remove all the overhead and complexity of the DB and manage it by using the local file system and build my own replication schema.

I’d start with chain replication, but there are trade-offs when compared to raft/paxos. The reason I’d bias towards chain replication is that I would invent a bi-directional logger and leverage prediction.

There’s an interesting thing that happens when you think about having compute and replication live together on a chain. On one direction, you durably persist messages in a queue. This queue can be then used to flow control writes from clients. Once your messages arrive on the tail of the chain, it can be converted into a data delta and sent back on the chain. The reverse direction atomically deletes from the queue and updates state.

This has the potential to split the CPU burden between the head (for privacy computations and viewer updates) and the tail (doing work). This split allows you to optimistically update clients because the head can have a queue of messages which can be evaluated locally to turn into data updates on clients and then reverted. The state will be reconciled once durably persisted.

a single service

Pros

Durable
Able to survive software updates
Scales Adama up dramatically due to sharding Adama
Reliability of Adama is managed by consensus and fail-over
Compute offload to the WebProxy reduces Adama burden
No longer vendor locked

Cons

Super complex
Expensive on my time
Requires reliable failure detection
Requires a service to manage the chains

Does this have legs? Maybe as a research project. Now, it’s just too much to even think about when I can just use RDS. I need measure a bunch of things before I go off and build a new logger from scratch.

Next steps

For developers using the open source package locally, a “YOLO” box makes a ton of sense. Just configure it against a single MySQL server, and then boom.

For production, I’m thinking about how do I configure Adama with the scripts and how do deployments work. I’ve committed to building a failure detector using gossip since gossip is so cool. I’ll post updates on the design of how routing works.

January 1st, 2022 The path towards reliability (and scale) for great sleep By Jeffrey M. Barber

Ultra YOLO

Mega YOLO

YOLO

Durable

Defense against the interwebs

Expensive Load Balancer

Condenser

Distributed Load Balancer

Distributed Load Balancer plus Sharded Database

Distributed Load Balancer with Chain Replication

Next steps