Going hard on low latency, massive scale, with multiple regions

Recently, I optimized the CDN for a single region, and the task of going to multiple regions is non-trivial. I’ve thought about this in the past since the platform is a mess with respect to multiple regions. I thought about what massive scale would mean. The task at hand is to lower latency in a multi-region context at massive scale. Today, I write up the thoughts as a ramble. Don’t expect much.

The multi-region mess has been well documented, and I recently experimented with a planetscale database located in Oregon with a machine in Kansas City. This new setup had a p95 latency of 600ms which was both expected and unexpected. However, by re-introducing the caching hack the latency dropped to 6ms. Now, it makes sense to revisit some prior assumptions.

The first hard truth is that caching anything without talking to the Adama service is exceptionally hard. We can try to formalize the caching protocol between the web tier and Adama, but this will be confusing and doesn’t solve for cross-region latency in the cold case. Instead, the ideal is to replicate the document from the origin region to remote regions. We can do this by having Adama tail the origin’s data stream. This way, the web tier always talks to Adama within the remote region and the core trade-off is consistency. Beyond replication, there is a key difficulty in how long to keep the replica alive within a region. This is yet another policy variable as it does directly translate to customer cost.

The regional replica is the keystone for massive scalability, and the key trick is to figure out when to go beyond a single replica within region. This is, at core, a protocol issue between the web tier and Adama fleet. First, for massive scale, we will require a new method which centers around reading a document at infinite scale. This new method will then opt-in to tailing a read replica, and there is some relatively simple math for how to bring more replicas on-board. This will assume capacity management keeps up with the demand (which is a big if).

The above work will require tweaking the protocol between the web tier and Adama, and this is the opportunity to rethink latency and cross region. Fortunately, the latency between the web tier and Adama is low, so we simplify the client to Adama interaction. We remind ourselves, in the current design, there is a single global master for an individual document. The source of truth is held within the global MySQL database. The multi-region design currently makes that lookup very expensive.

First, we identify the scenarios of where data may be located.

scenario name	requestor	location
cold	any region	archive
warm	region=X	region=X
warm-remote	region=X	region=Y

The core question now is what can we cache and where we cache to minimize the global lookups. The current code has the web tier checking with MySQL, but this is no good. As previously discussed, this should be a call to Adama within region using optimistic placement rules. The web tier will guess using a variety of heuristics which Adama host to chat to, and Adama will respond with either a host within the region or another region along with the host. We need to add the plumbing such that the remote side can intercept the host to use as this will minimize the need to re-guess which host to talk to.

The heuristics that the web tier will be a form of rendezvous hashing based on the accumulated mapping for spaces to Adama hosts (there is a gossip protocol between web tier and Adama where web tier has a complete mapping of all spaces to all hosts). An important idea at play is that the hashing should be accurate during stable times of peace, so this should minimize latency. However, we need to handle the really cold case where Adama hasn’t even loaded the space yet. Here, we need Adama to detect a space it hasn’t seen before and then download and deploy the code.

We need to implement the findbind operation such that with one MySQL transaction we either look up the record OR bind the record to the requesting host. This operation can be cached within Adama for some time, and the Adama host needs to know if it acquired the host.

scenario name	behavior
cold	the host is elected as the master, returns itself
warm	the host returns the another host to talk to within region
warm-remote	the host returns another region to talk to along with the optimistic host

Generally speaking, caching is problematic. We consider two scenarios.

First, the source of truth transitions to archive; here, the staleness of the cache will guarantee that resurrection of that host as the master. This stickiness makes load shedding more complicated, so the protocol between web and Adama requires some degree of flexibility to adapt to changing circumstances. Fortunately, we control the protocol between web and Adama and can bounce around a few times to find the correct host.

Second, the source of truth transitions to archive and then do a different host. Here, we find the load shedding protocol again. The key here is that when talking to a host, it needs the flexibility to point to a different host. The host that transitioned it to archive can either forward directly or talk to source of truth.

The key is that the protocol must, to some degree, allow redirects. The interesting aspect is measuring service health by the number of bounces that have occurred between web and Adama due to stale caching, and then setting a hard limit.

For the end game, the web tier will never consult the database, and Adama is then responsible for talking to the database. An Adama host will know absolutely if it is the correct host, and it can cache pointers for the web tier to chase down. If those pointers are wrong, then the other Adama host will forward. We should detect cyclic references, and then either elevate the next request to invalid a cache or fail the request and monitor it.

I have my work cut out for me. Once this bit of engineering is finished, I intend to focus on the macro game rather than just writing code.

February 21st, 2023 Going hard on low latency, massive scale, with multiple regions By Jeffrey M. Barber