From Single Process to Global Cluster

Scaling Real-Time Infrastructure Honestly

I started building Adama as a single process on a $5/month VPS. Documents lived in memory, state was volatile, and a power blip meant total data loss. That was fine for a prototype. Then people started using it, and I had to figure out how to make it production-grade without lying to myself about the tradeoffs.

This is the story of 10 topology levels I designed on the path from "Ultra YOLO" to clustered multi-region deployment, the gossip failure detector I built from scratch, the routing problem I solved with rendezvous hashing, and the multi-region mess I created and then (partially) cleaned up. Every level has a cost. I will tell you what those costs are.

The 10 Topology Levels

I designed these as a progression, each solving one problem from the previous level while introducing new ones. The progression is instructive because it maps to how most real-time systems evolve in practice.

Level 1: Ultra YOLO. Single process. WebSocket frontend coupled directly with the document runtime. One command to start. Lowest possible latency. Yearly data loss chance: 100%. This is not a joke -- it is literally ephemeral. But for local development and throw-away real-time experiences, it is the right answer.

Level 2: Mega YOLO. Split into frontend (Web) and backend (Adama) processes connected via internal protocol. Gains the ability to update the web tier independently. Still not durable. Still a single host. Still 100% yearly data loss.

Level 3: YOLO. Add MySQL. Now software updates do not destroy data. Yearly data loss drops to ~5% (matching server failure rates). The cheapest offering that feels "responsible." Good for developer testing environments.

Level 4: Durable. Replace self-managed MySQL with a Database-as-a-Service (RDS). Durability is now someone else's problem. This is the first level where I would let a paying customer's data live.

Level 5: Defense. Add multiple web proxies to front-load traffic. Mitigates denial of service. The Adama host is still a single point of failure.

Level 6: Expensive Load Balancer. Shard documents across multiple Adama hosts behind a single stateful load balancer. Dramatically increases capacity since memory is the primary bottleneck. But the load balancer itself is a single point of failure and choke point.

Level 7: Condenser. Rename the load balancer to "condenser" -- it has one job: routing streams. Offload TLS, auth, and administrative APIs to the WebProxy tier. Reduces the condenser's burden, but it is still a single point of failure.

Level 8: Distributed Load Balancer. This is where it gets hard. Remove the single condenser. Every WebProxy can route to any Adama host. This requires solving three problems simultaneously: reliable failure detection, consistent routing, and split-brain handling at the storage layer.

Level 9: Sharded Database. Take over the database tier with replicas co-located with Adama hosts. Removes vendor lock-in. Adds operational burden.

Level 10: Chain Replication. Replace the database entirely with local filesystem and custom replication. Head of chain handles privacy computations and viewer updates. Tail handles durable state. This has theoretical advantages for splitting CPU burden but is too complex for one person to build and operate.

I launched at Level 8. It was the minimum viable topology that I could honestly call "production" without lying about reliability guarantees.

Production Topology (Level 8): Three-Tier Architecture Web Proxy Tier TLS termination, auth, admin APIs, routing decisions Proxy A Proxy B Proxy C Proxy N ...

Gossip-informed routing (rendezvous hashing)

Adama Host Tier Document VMs in memory, message processing, reactive delta sync Host 1 Host 2 Host 3 Host N Storage Tier (DBaaS / Caravan) Write-ahead log, snapshots, conflict detection for split-brain Stateless proxies: easy to scale Each doc lives on exactly 1 host Storage rejects stale writes

The Gossip Failure Detector

Level 8 requires reliable failure detection. When an Adama host dies, every WebProxy needs to know so they can reroute traffic. I built a gossip-based failure detector from scratch, loosely following the Cornell gossip failure detection paper.

The core idea: every server heartbeats itself at a regular interval (every second), incrementing a counter. This counter is replicated via gossip across the cluster. If server X's counter has not changed in 25 seconds, it is declared dead.

The math is favorable. For 1,000 hosts with gossip every second, recent counters spread to every host in roughly 10-11 rounds (log2(1000) ~ 10). That means maximum detection time is about 25 + 11 = 36 seconds. In practice, most failures are detected in under 15 seconds because you are gossiping updated counters continuously.

I designed the gossip protocol with an instance set chain -- a map of hashes to instance sets that acts as an implicit history of capacity changes. When two peers gossip, they first compare hashes. If they agree, they exchange only the mutable content (counters). If they disagree, they fall back to exchanging instance lists. The happy path (hash match) completes in 3 messages. The worst case (no hash match) requires sending everything, but this only happens during capacity changes.

Gossip Protocol: Failure Detection Flow Initiator Remote Peer hash + recent instances + deletes hash found! + counters + missing instances my counters (exchange done) Happy Path (common case) 3 messages, counters only, minimal bytes If hash NOT found (capacity changed): full instance set (all instances) full instance set back (expensive!) Slow Path (rare) Only during capacity changes; monitor this

Deletion is the tricky part. An instance is considered dead if it has not been heard from in 25 seconds. It becomes eligible for deletion by peers after 20 seconds. The 5-second gap accounts for clock drift but ensures deletions propagate quickly. I am still tuning these values -- they directly impact availability SLA.

Rendezvous Hashing for Document Routing

Once gossip tells every WebProxy which Adama hosts are alive, the next problem is: given a document key (space, key), which host should own it?

I use rendezvous hashing (also called highest random weight hashing). For each document, compute a hash against every live Adama host. The host with the highest hash wins. When a host dies, only the documents assigned to that host need to move. When a new host joins, it only steals a fair share of documents from existing hosts.

The binding must be reactive. When gossip detects a host failure, the routing table changes, and every WebProxy re-routes affected streams. The retry logic lives in the WebProxy tier -- the client never sees infrastructure failures as long as the document can be loaded on another host.

We expect (and embrace) the possibility that two Adama hosts temporarily have the same document loaded -- a split brain. The storage layer handles this by rejecting writes from the less up-to-date host. That host detects the rejection, fetches the latest state, and retries. This is not ideal for latency, but it is correct.

The Multi-Region Mess

I wanted Adama to run on the edge, close to users. The initial plan was simple: put machines in multiple regions, use a global database for routing, done.

The reality was a mess. My first design had the WebProxy querying a cross-region database on every connection. For a document warm in the local region, that added 150ms of unnecessary latency (the cross-region database round-trip). For a document in a remote region, connections took 462ms with multiple cross-region hops.

I went through several iterations:

  1. Move routing decisions to Adama hosts. Instead of WebProxy -> Database -> Adama, do WebProxy -> Adama (which knows if it has the document). Warm local connections dropped from 162ms to 12ms.

  2. Use gossip-informed guessing. WebProxy guesses which Adama host has the document using rendezvous hashing. For the 99%+ case where the guess is correct, latency is minimal. For the wrong guess, fall back to database lookup.

  3. Introduce findbind. An atomic database operation that finds a document's location and optimistically binds it to the requesting host if unbound. Cold document connections dropped from 310ms to 162ms.

The final architecture separates concerns clearly: control plane (authentication, deployments, space management) goes to the global region. Data plane (document connections, messaging, real-time sync) stays local. The practical effect is that establishing a connection might be slow (cross-region control plane hit), but interacting with a product is fast (5ms local reads and writes).

The honest tradeoff: if you connect to a document that is loaded in a remote region, your latency is the sum of local region latency plus inter-region latency. Documents are biased to the first person who loads them. For local businesses, this is great -- the owner loads the document in their region, and customers nearby get fast access. For a globally distributed group, someone is always far from the document.

What I Would Do Differently

If I started over, I would not have "just talked to the database" everywhere. That helped me iterate fast on the product, but it created a coupling nightmare when I needed to go multi-region. Every database call inside an event thread became a latency landmine.

I spent months converting synchronous database calls to async interfaces with callbacks. The pattern was straightforward but tedious: every blocking call like Domains.get(database, host) became an async call with a Callback that handles success and failure separately. This conversion is the boring, unglamorous work that makes multi-region possible. I knew I was making a mistake when I wrote the synchronous version. I paid the price later. You will too, if you start with blocking database calls in your event loop.

The gossip detector, though? I would build that again. It is simple, resilient, and gives me exactly the failure detection properties I need. The rendezvous hashing is solid too -- adding or removing hosts is smooth, and the failure modes are predictable.

The multi-region story is not finished. I need better monitoring, canary testing, and acceptance tests that run continuously. But the architecture is sound, and the path from Level 1 to Level 8 taught me where every tradeoff lives. I hope showing you the full progression -- including the dead ends and the messes -- helps you make better choices when you face the same scaling decisions.