Thoughts on Capacity Management (and more)

This is an open design document and stream of consciousness flow session. As I’m trying to focus on getting to a minimally scalable version that is exceptionally low latency, I’ve running with my new storage engine called caravan which I’m optimistic about. I’m working on getting it deployed to production, and I’ll have introduced a single point of failure. Obviously, this has problems, but it’s also a common problem for many environments running a database.

The hardest thing since leaving big tech is the lack of access to low level teams which can offer a service with my precise input and keep it alive, but I’m optimistic about a few vendors like redpanda. However, no matter how good redpanda is, they will not be comparable to what I’m doing with a local disk with regards to latency. I’ll leverage redpanda in a future implementation which would satisfy enterprise concerns. For now, I’ll focus on the productivity benefits of Adama, exceptional low cost, and just be pragmatic. The new design looks like this.

new design

The flow of this diagram is that a request from a person hits the frontend WebSocket tier, that tier then needs to go to a database to learn where the document is. Either the document is on a host or it isn’t. If it is on a host, then great just forward it. Otherwise, we pick a host and let that host bind to the database. The database needs to provide a transactional guarantee that only one host can win, and the losing hosts need to then reject traffic. An unfortunate aspect of this design is that there is extra latency on the connection establishment to a document, but that feels ok-ish. It’s worth considering if the front-end needs to check at all, but let’s put a pin in that for now.

There is now a static mapping of documents to hosts. This static mapping of document to host introduces the database as a failure point along with the host. I intend to switch to planetscale for my database provider with a single global region. Fortunately, this single region in only the control plan and the data plane is on host. Since host failure is like Thanos and inevitable, I’m backing up at the document level periodically while there are changes.

Backing up each document to something like S3 or Backblaze scopes my durability concerns to the recent few minutes of activity. Since backups don’t matter if restores don’t work, I’m making restores the main path. Once a document is closed, then it is backed up, unbound from the directory, and then deleted locally. This has some interesting consequences.

First, documents can follow users between regions. If you fly from LA to New York, then your document will zip on over to the cloud and hydrate on the east coast. Second, the storage on a single host mirrors the memory capacity so the dimensions of scaling are reduced from (cpu, memory, storage) to just (cpu, memory) which is fantastic. Third, if a host gets overload, then sacrifices can be made to push documents to the cloud and be hydrated elsewhere. Fourth, deployments can force host drains which allows a great deal of agility when it comes to the disk layout and other aspects.

As a bonus, this design can carry forward as the hard-core distributed system stuff can work to augment these benefits by further reducing the scope of failures. The key impacting issues to contend with in the future are the availability loss from host failure, elevated latency from host drains, and durability problems for writes written to disk during a host failure without a backup. This solution is pragmatic as I can focus on the higher levels, and I know people that I can hire to solve these problems.

The tricky problem which I’m thinking about as I lay down the current state of the nation is capacity management. Specifically, this design is prone to preserving hot spots since the mapping between document is fixed from the restore to backup and close. Fortunately, a hot host can simply drain documents and fail updates, so this is a task on my list to contend with in the future if this becomes a problem. The pressing problem is host selection and detecting the need for new capacity, and I’ve sketched out that I can use the existing elastic host selection to pick a host. However, I never solved how elastic capacity management worked. So, here I am designing in this moment. Writing as I think this through.

Fortunately, I haven’t invested much in capacity management, so I free to explore without much baggage. What I do have is a mapping of spaces (i.e. the Adama java code driving document changes) to hosts, and this is used with the gossip mechanism to produce a routing table. The reactive mapping of documents to hosts is done via rendezvous hashing.

My original thinking was to have the overlord service (a service not on the diagram which collects stats, host prometheus, and provides some operational insight and control) monitor which spaces were causing heat and then expand the hosts for that space. With the reactive routing, this would cause an immediate traffic shift.

With static mapping, the immediacy of the traffic shift would only apply to new connections, but an Adama host could detect this and start load shedding. The nature of adding and removing capacity interrelates to both load shedding and host selection.

For instance, Adama should not load shed unless there is new capacity to take over unless the entire service is at risk. Otherwise, traffic may return to Adama and cause additional load. This begs so many questions. First, who decides when new capacity is added. Is it the frontend? Is it Adama? Is it the overlord?

Before I can answer these, I have to think through whether or not the mapping of spaces to hosts is a good idea. The reason for this map is to minimize the number of types of document within a single Adama host as the JVM can only host so much code. Not only does this protect the JVM, but it allows me to think about noisy neighbors in the future as this service is multi-tenant with the goal of minimize cost for myself and the customers. This feels like the right way, but the nature of capacity management becomes how to automate the management of this mapping to spaces. The game I’m trying to play is (a) when do I add a host to the mapping for a particular space, (b) when do I remove a mapping of a space to a host.

From having started to write the code, the front-end kind of makes some sense. However, the directory is problematic because if you have a mild herd of requests for the same document within a few milliseconds (i.e. a group of players clicking the link), then picking a random host until a winner is decided is a recipe for conflicts. I’m not a super fan of that, so maybe I can re-use the reactive routing to decide where an untethered document lands. This solves the herd problem without answering either (a) or (b).

While the cpu and memory utilization is flowing from Adama to the web tier, this information could be used to decide if all the hosts are too hot and then bring new capacity on board. However, this doesn’t help with (b) and it may bring too much capacity to a space as the decision is decoupled from any central authority. This makes me feel like the front-end is not the right place to manage capacity beyond deployments of new code to spaces.

So, it is either Adama or Overlord. Given the relationship between Adama and load shedding, it may make the most sense to implement this in Adama since Overlord is going to miss context. First, an Adama host can detect “Hey, I’m hot”, and here it can select a space and then bring more capacity in for that space. As a protection against overgrowth, bringing on more capacity could be velocity checked such that it is only allowed to ask for more capacity for a single space over some period of time.

Overlord could perform a similar function, but an Adama host has the additional advantage that upon making the decision to bring in a new host can then load shed precisely the documents which will go to the new host. This is possible because Adama knows both the new host and itself and can make the relative rendezvous hash decision. This settles in my mind that Adama is the right place to bring new hosts into the picture.

With an Adama host making the decision to deploy, this scopes the question to which spaces Adama should select to load shed to. Each Adama host is responsible for a set of spaces, and the decision could be made based randomly or some property. Dimensions to play with are: number of documents on host, memory, cpu, and messaging traffic. I believe there are many adversarial conditions which make a fixed decision problematic.

For example, a very large document would cause tremendous capacity to come online in hopes of load shed a single host. This requires flipping the load shedding on it head and asking how to ensure a host can become dedicated to a single customer over time if they have some whales in their mix. It’s worth calling out that we are in the business of solving knapsack kind of problem which is NP-complete. This requires serious science.

It’s worth the effort to pause thinking about how to load-shed and isolate a single customer and think about how to remove the mapping of a space from a host. For reasons similar to adding capacity, I’m inclined to have the Adama perform this logic as it can also evict the documents locally. So, Adama, wakes up and then the decision to evict a customer can be done by how cold it is. So, if there are few documents for a single space on a host and the memory and cpu are low enough, then evict. This has the problem that all capacity for a small customer could go poof, so we need some way to ensure a minimum capacity. Either the function that restricts capacity will reject it or Overlord should do it. Given the ease of implementing a transaction, I think Adama should do it and treat failures as cause to back-off on that space for some time period.

So, we still need to figure out the precise mechanism to add capacity. We could pick randomly, and that will actually do OK. I wonder what a power of two choices could do. Suppose I pick two spaces that are available for consideration, and then what? Do I pick a random dimension? I feel like I should use document count as that will spread large customers over the entire fleet and then localize the impact of small customers. This is worth study, but I need to just pick and get the game started. So, let’s try to equalize the number of documents within a host per space. This requires a literature review, so let’s google some shit.

The abstracts of these documents give me faith, so let’s put a pin in this question and say there is some way to algorithmically. For the sake of progress, I can start to sketch out an interface for Adama to call.

interface CapacityManagement {
   /** Hey, we can continue the train of thought in comments. So, here, Adama has _somehow_ picked a space to loadshed from the host and we abstract out the call to work with the database. If the callback is successful, then we can loadshed documents. The String return type is the target to hash against.  */
   public void request(String space, Callback<String> callback);

   /** Here, the Adama host has decided that a space is not utilized well at all on the host, so it requests to unmap the host. We can decide if we load the documents now or after some period. */
   public void freeMe(String space, Callback<Void> callback);
}

This manages the spaces to hosts mapping, and a key missing piece is the operational velocity checking. A key situation to consider is a thundering herd of requests, and how long do we want capacity to come online for it. First, how often does a host signal for help. Second, how often does a host signal for help for a specific space. How often does a host free a space from a host? Furthermore, is there a maximum number of hosts for a particular space.

The ideal world would converge to a space where a host thundering beast of a customer would be isolated to dedicated hardware while everyone else is then appropriately multi-tenant. So, how we pick new capacity to load shed too requires consideration. However, we can get started by building an agent to monitor the host such that it gets woken up on some condition like 70% memory utilization or 80% cpu utilization. Then, every 10 minutes, it makes a decision to find a space to load shed. If that space has been shed in the last hour, then skip it. These numbers are arbitrary, but we can do the science later. This falls under the guise of “better than nothing”.

We can do a bit of back of the envelope analysis. If a hot herd comes thundering down the hill, then capacity will double in ten minutes. If that herd keeps thundering on, then that capacity will grow 50% in the next ten minutes. This is because the new capacity will start to shed traffic, and this will continue until some maximum capacity is reached. This begs the question of when to wake the human (i.e. me) to manually add real capacity to the fleet. The moment I get woken up, I’d automatic adding more capacity. I think the appropriate way is to say once the maximum capacity is 50% utilized, wake me up. This means I should watch the situation.

This also implies I’ll need tooling to control the situation. Maybe I need to raise the maximum? Maybe I need to manually migrate other documents? Maybe I want to isolate a customer? The operational questions feel important as the feed into host selection. In some sense, I need some signals to determine if a space is isolated or if a host is off limits. The hard science demands are countered by human meat in the loop.

The roadmap is taking form, so I’ll take a stab at building a TODO list. Huzzah for TODOs.

☐ Add the CapacityManagement interface to core
☐ Update CoreService and DocumentThreadBase to fail running documents based on a hash of the key (or, Just a Function<Key, Boolean> for ease)
☐ Update CoreService and DocumentThreadBase to keep a cached copy of the inventoryBySpace such that this can be used to understand what is on the host. Call this cache InventoryCache
☐ Have InventoryCache have some functions like getSpaceByLeastMemory() or getSpaceByMostDocuments();
☐ Build the internal agent to watch Adama’s memory and cpu; this agent will drive behavior to evict or add capacity. Adding capacity will only happen when heat is on, but removing capacity will happen based on thresholds
☐ Implement CapacityManagement against mysql
☐ Assigning capacity will respect a maximum capacity per space with some override config, respect operator preferences
☐ Overlord has a dashboard allowing operator to isolate spaces, manually assign capacity, or remove capacity
☐ Instead of failing capacity management calls, we need a filtered list of space candidates to consider. For example, velocity check on a space should limit the periodic candidates. Operator preferences should limit the spaces available.

Is this it? Maybe. Let’s see how important scenarios are handled.

First, the agent will wake up periodically to free low traffic spaces if the space has passed operator and velocity checks. The hard question is how fair the coverage map will be once everyone converges down to a minimum. Picking a random space to evict has no guarantees, but does feel like a decent starting point.

Second, the agent will detect heat and then pick a space to add more capacity to which has passed passed operator and velocity checks. This will allow capacity to double in the first X minutes, and then increase linearly to original capacity every X minutes. After Y minutes, more capacity can be added and the rate of capacity coming online will double. Suppose X is 10 minutes and Y is 30 minutes. This produces the capacity table.

Minutes into a thundering herd	Capacity
0	3
10	6
20	9
30	12
40	18
50	24
60	30
70	39
80	48
90	57

This feels OK to start, and it is tunable. The question now is what alarms should be in place and when to wake the operator when this is too slow.

The final scenario is managing the minimum capacity. Currently, the overlord tries to ensure that every space has three machines. The only missing aspect is having this logic obey operator limits.

This feels like a great way to get a fun game started! I’ll shred the above TODO list into how I track tasks, and then I’ll get to work.

June 5th, 2022 Thoughts on Capacity Management (and more) By Jeffrey M. Barber