Q1 2023

Optimization

CDN Work and Fighting Burnout

I started the year inventing board game AI and ended it measuring HTTP latency with apache benchmark. Somewhere in between, I had a crisis about the sheer volume of things I was trying to build at once.

The AI work was fun. I was exploring how to build artificial players for board games, specifically ones that require deception -- like my beloved Battlestar Galactica board game. The platform already gives me the worst possible AI for free: a random bot that picks from a multiple-choice set of legal moves. The open question was how to go from "random" to "competent liar." I sketched out ideas for prediction functions and objective annotations in the Adama language, and then I looked up from my notebook and realized I was building a game engine, a web framework, a WYSIWYG editor, and a new type of cloud simultaneously.

Seriously, what was I thinking?

The burnout post in January was honest. I was in a delusional haze driven by ego, stretching myself across too many fronts. The great benefit of wandering is discovery, but the wicked problem is loneliness -- and loneliness compounds when you can't explain to anyone what you're even doing because there's too much to explain. I had to make cuts. Roslin went on the back burner. The WYSIWYG editor got shelved. RxHTML went to maintenance mode. The game engine ambition got replaced with Phaser.js because "building an online board game the hardest way possible" is only going to lead to despair.

Cutting scope made me feel sick. But it forced a better question: what is the actual business? The infrastructure. Web hosting. Real-time services. Things I have genuine credibility to offer, given my time at Amazon S3 and Meta.

With the strategy simplified, I turned to the CDN. The web hosting side of Adama existed in a "it works" state, which is a polite way of saying it was unoptimized garbage. My first benchmark run showed a p95 latency of 246ms for a simple HTML page. The breakdown was painful: 68ms DNS, 43ms connecting, 47ms TLS, 107ms waiting. Being a 100% SSL service was expensive.

I peeled the onion layer by layer. First, certificate caching -- I replaced a never-invalidated cache with a time-expiring LRU. That didn't move the needle on performance, but it fixed a correctness problem. Then I looked at the Adama request pathway. A naive perma-cache hack saved about 30ms, but it was so incorrect I backed it out. The real discovery came when I found the web tier was creating a fresh TLS connection to S3 for every single asset request. CPU hit 100%. Every asset download required a full SSL handshake. Yikes.

Adding connection pooling (100% organic, no library) took the p95 from 330ms down to 145ms at full concurrency. Then I formalized the asset caching with a two-tier policy: memory for small HTML files, disk for medium files, skip large files. Assets are immutable after upload, so the caching is perfect -- no invalidation needed. That got us to 64ms p95 for memory-cached assets.

The numbers started looking competitive. On an a1.medium instance costing $0.0255/hour, with the full caching hack enabled, I was seeing 859 requests per second at 100 concurrent connections. The per-request cost worked out to approximately $3.6e-08, which might as well be zero. The NLB choice over ALB was paying off too -- 32x fewer LCU consumed because we handle TCP directly and terminate SSL ourselves.

Then I started thinking about multi-region. A test with a PlanetScale database in Oregon and a machine in Kansas City showed 600ms p95 latency, which was expected. Re-introducing the caching hack dropped it to 6ms. The gap between cached and uncached made it clear: caching everything without talking to Adama is exceptionally hard. The real answer is document replication -- having Adama tail the origin's data stream so the web tier always talks to a local replica. This creates a regional replica model where the core trade-off is consistency, and the core difficulty is knowing how long to keep replicas alive.

I sketched the multi-region protocol: web tier guesses which Adama host to talk to using rendezvous hashing, Adama responds with either a local host or a redirect to another region. The protocol allows bounces. We detect cycles. We set hard limits on redirect depth. For the end game, the web tier never consults the database directly.

The quarter ended with my work cut out for me, but I was closer to the real game rather than some delusional ego trip. The infrastructure is the business. Everything else is noise I have to tune out.