Q1 2022

The March to Production

I like sleep. This simple fact drove every architectural decision as I built Adama towards a production launch in Q1. The question wasn't whether to ship -- it was whether I could ship something that wouldn't wake me up at 3 AM with a pager going off.

The quarter started by mapping out every possible deployment topology from "Ultra YOLO" (a single process, yearly data loss chance: 100%) through progressively more serious configurations up to a distributed load balancer with chain replication. Each topology trades cost for reliability in predictable ways. A $5/mo host with no database is cheap and fast but entirely unsurvivable. Adding MySQL via RDS buys you software update survival at the cost of 5% annual data loss. Adding a distributed load balancer with gossip-based failure detection finally gets you somewhere responsible.

I committed to the distributed load balancer model with gossip because gossip is so cool. The implementation uses an instance set chain -- a map of hashes to instance sets where capacity changes are tracked efficiently. Two peers exchange counters by first checking if they share a common hash (the happy path, three messages). If they don't, they reverse the initiation and try again. If neither has a common basis, they fall back to a slow full exchange. Instances are considered dead after 25 seconds without a heartbeat, eligible for deletion at 20 seconds. That 5 second gap accounts for clock drift while keeping failure detection responsive.

The gossip protocol fed routing tables, and routing tables fed the web proxy tier. All of this was working in staging with seven nano hosts spread across three availability zones, a small RDS instance, and an ELB. My gossip was gossiping. My routing tables were routing. Deploy kind of worked.

I'm not smart at business. I'm an engineer's engineer building out of joy. The core module had 2,781 tests at 100% coverage. I set a launch requirement of 95%+ coverage for critical modules. By February, the test count had grown to 3,131 with Jenkins running 500+ successful test runs over two days. The saas module jumped from 29% to 96% coverage.

Meanwhile, I was thinking about why pub/sub sucks. Having spent over half a decade as a technical leader for a very large real-time distributed system, the fundamental problem is quadratic complexity. If N subscribers are also publishers, the networking chokes on fan-out overhead. The traditional fix is a client/server model where the server ingests everything and vends compact aggregated updates. This is exactly what Adama does -- it inserts itself as a reducer, crushing N publishes into one data write, then forwarding that one write to N subscribers. The quadratic complexity drops to linear. The worst case for Adama is being used like any other pub/sub system. The best case is entire board games.

I also spent time thinking about streaming query languages and how to break data out of Adama's silo. The observation: everything is a document. JSON, a single variable, a giant SQL table, a query result -- all documents. The pragmatic path was tailers first, then document indexing, then foreign document embedding if someone showed up with a down payment. Starting with the query language first is a recipe for churn.

The open strategy crystallized around three prongs: infrastructure (open source, rock bottom pricing), an IDE (the meta problem of making web development not suck), and games (the actual business). Adama has one customer at the time of writing: me. Success means Adama spawns franchises across the world. If a big tech company copies it and offers it cheaper, great -- I can use it and shift focus.

Then I launched in early access and got punched in the face.

Load testing revealed that my MySQL-via-RDS storage was slow and my gRPC layer between web and Adama hosts was burning CPU like crazy. The client traffic barely registered on the CPU, but the internal protocol was a disaster. I benchmarked 1,600 player connections to 400 games with each player making a move every 10 seconds. Adama's raw cost ($0.2285/hr) was 78% cheaper than an equivalent AWS serverless stack ($1.049/hr) using Lambda, API Gateway, and DynamoDB. But the latency was not where I wanted it.

So I ripped out gRPC and replaced it with vanilla Netty and simple code-generated codecs. The results:

concurrent streams	grpc p95 latency	new p95 latency
400	65 ms	54 ms
800	75 ms	67 ms
1600	400 ms	260 ms
3200	2100 ms	1200 ms

CPU usage on the web proxy dropped 50%. Not great yet, but directionally correct.

The bigger problem was the data side. Compaction in MySQL was blocking the data service -- scanning 285,000 items when the document history hit certain thresholds. This required a fundamental rethink. I introduced "volatile data" as a concept for UI animation state that doesn't come from the server, started designing a new write-ahead log, and went deep on disk performance benchmarks.

The tyranny of small things: hardware works on predictable batches. I measured everything. Opening and closing files per append: slow. Holding the file open: 3 GB/sec (matching my NVMe). Java's scheduler ignoring sub-millisecond requests: problematic for low-latency flushing. But Thread.sleep had precision, so I built a custom precise scheduler.

The first prototype of my custom storage (before correctness testing) hit 347 MB/sec with 1.8M writes/second -- over 10x RocksDB's 27 MB/sec. After adding failure mode handling, unit tests, metrics, and log rotation, it settled at 250+ MB/sec. I named it Caravan.

Integrating Caravan into Adama dropped p50 latency from 11ms to 2ms. Eliminating the RDS instance ($0.152/hr) cut hosting costs by 66.5%. With the new storage and networking, Adama could host 4,000 players playing 1,000 games at half the previous latency. Compared to AWS serverless for the same workload: $0.0765/hr vs $3.33/hr. That's 97.7% cheaper. Serverless may be a cruel joke.

I ended the quarter with an early access launch, a handful of actual users, a custom storage engine, a custom networking layer, a gossip failure detector, and the honest admission that I have much work ahead. The next step was to get Caravan deployed to production and start thinking about S3 archival for long-term durability.

For the next few months, I'll embrace randomization and forgive myself for being a conflicted soul.