Writing to disk and the tyranny of small things

Today, I’m writing about the process of writing to the disk and building a durable solution which can be survive both a process death and power outage. Well, actually, this is wasn’t written today but was written over the course of a week of writing code; in this spirit, it’s really a journey of how I approached a problem. At core, I’m building a custom embedded data store like thing for Adama.

I’ll also be looking into using RocksDB before I embark on writing a specialized solution for Adama. The goal is to solve the “single node” storage problem for Adama with a reasonable backup solution, and I’m optimistic that a reasonable solution can be found.

Actually, scratch that, this isn’t about reasonable, this is about having fun to squeeze pennies and make computer go fast. The central problem is what could be called a “tyranny of small things” where the hardware works on predictable batches. In other words, the more you give the OS, the more that will get done. Let’s validate this to ensure we are always close to what the metal can provide.

First, let’s sanity check

It’s always helpful to know which patterns in code manifest in exceptional performance. So, I want to validate a few things that are obvious. First, we consider how bad it would be to just rely on the filesystem since that is the spiritual model we want.

open, append, flush, close

Now, it may be helpful to understand the units on the x-axis and y-axis. Each data point represents a run of data being written to a file with the x-axis being the total bytes written and the y-axis being the MB/second. There is a random distribution of writes of varying sizes which changes between runs. Unsurprisingly, excessively opening and closing a file to append will be slow. We can tweak this to hold the file open.

hold open

There we go, we start to see the fantastic 3GB/sec which is what my NVMe supports. It’s always good to know that there doesn’t appear to be much hidden between the code and the hardware. The above graph flushes after every write, so we then ask how effective is flushing at the end.

hold open and batch flush

Given the data sizes being written, it doesn’t appear to be super advantageous to hold off flushing when batch sizes get large. However, it’s worth it to put a pin in this as we need to zoom into the region of small writes. We can also see what happens when we throw a BufferedOutputStream to wrap the FileOutputStream.

buffered output stream

This is unintuitive. Given I’m writing just flat byte arrays which become large, this illustrates that using an additional buffer just does more work for no gain.

So, what have we learned. Well, obviously, writing larger chunks yields better performance. Problematically, this requires batching together writes ourselves.

Suppose, we do a buffer a bunch of writes, then schedule a flush in one millisecond. Well, the scheduler within Java has something to say about this.

scheduler accuracy

The x-axis here is the requested milliseconds to schedule work, and the y-axis is the measured milliseconds. Scheduler will kindly ignore your quick request to flush a batch and instead bias towards building a massive batch. This is exceptionally problematic if you intend to be low latency! Instead, we ask if this is a problem with the timer.

Thread dot sleep The good news is that you can sleep with precision! We can use this to build a low resolution scheduler. So, we write that and commit that. This feels like over-kill, but we can then measure the precision of this scheduler.

woot

With predictable scheduling, we can then figure out ideal batch sizes to achieve maximize throughput. At this point, let’s switch gears. The above work illustrates two concepts. First, if we batch within the application code, then we can achieve high performance. Second, we should take control over the timing for when throughput is low (and time is the flushing mechanism). These insights allow us to build a high through-put write ahead log.

The next part is how to index and shred the log into meaningful data structures. This requires care and is exceptionally challenging, so let’s compare some solutions at hand to get a sense of what is possible. First up, let’s look at RocksDB.

Looking into RocksDB

We can use RocksDB with the JNI, and I can trust RocksDB fairly well. So, we setup a benchmark to write 100,000 keys with values between 64 and 512 bytes. We then measure the time it takes RockDB to execute, and the below table shows a sample of samples:

MB	MB/second	ms	writes/second
19.16	23.43	818	122249
19.11	29.26	653	153139
19.18	28.88	664	150602
19.08	29.4	649	154083
19.14	26.47	723	138313
19.13	27.36	699	143062
19.07	26.59	717	139470
19.18	27.05	709	141044
19.12	27.4	698	143266
19.08	27.97	682	146628

Here, we see RocksDB getting ~27 MB/second with a huge ~140K writes/second. This is encouraging as I could just drop RocksDB into Adama and it would be a massive improvement over the current version. The key issue would be how reading and iterating would impact the performance. However, the 27 MB/second is bugging me to no end (and the JAR is 50MB which also irks me).

If I was reasonable, then I’d probably stop here and move on with my life. Alas, here I am about to build something specific. It’s worth noting that I considered H2, ObjectDB, and sqlite. I decided that I’d experiment rolling my own first since I doubt those solutions could compete with RocksDB.

It’s also worth noting that maybe I could achieve more throughput with RocksDB if I used multiple threads, but I’m biasing everything towards single threaded performance.

Design an Adama specific embedded “database”

The nice aspect of designing from scratch is that you can tailor your solution specifically to particular usage patterns. For Adama, the key operation is “append a delta” to a log which happens after a get. Since Adama is effectively a write-ahead cache in spirit, gets are rare.

One of the big changes in-flight right now is that we are biasing away from compacting the log using algebra to have to Adama periodically snapshot the state. This snapshot saves a tremendous amount of compute as we don’t need to unpack a stream of JSON, do the algebra, then merge them back.

We will bias towards a happy path where Adama will snapshot periodically and when unloading a document. This will give us some flexibility for storage retrieval. Furthermore, taking a snapshot is the perfect opportunity to trim the log down as well. This lets us bias towards making write exceptionally fast with potentially slow reads.

So, let’s design the ideal interface for which Adama could use:

// reading should be streaming focused
interface ByteArrayStream {
  // a new append was discovered
  void next(int appendIndex, byte[] value);

  // no more appends were found
  void finished();
}

// the logger
interface ByteArrayListDB {
  // reading will read the entire history of an object
  void get(
    long id,
    ByteArrayStream streamback);

  // writing appends a single change
  void append(
    long id,
    byte[] value,
    Callback<Void> callback);

  // we can trim the log by skipping ahead and freeing up memory
  void trim(
    long id, 
    int appendsToRemoveFromHead,
    Callback<Void> callback);

  // delete the entire object
  void delete(
    long id, 
    Callback<Void> callback);
}

A big consideration is that I’m also dropping the desire to store an infinite history. This will be a higher level concern where the consumer could either toss old data or archive into S3. At this level, the storage provider will do exactly what it is told. For the current market, archiving or infinite history are features with no customers. We only need enough history for reasonable product undo with enough redo history to deal with race conditions between conflicting Adama hosts.

We can use the write ahead log as a durable way to transform memory, but we need the ability to flush the internal memory to disk to delete the log. The game we are playing therefore requires us to describe every change to internal memory via the log and every data structure needs to be able to snapshot to the log. Since we are in a pure experimental and playful territory, let’s get to action by building a “durable heap”.

Building a heap

We want to take a large chunk of memory and then be able to yield parts of it. The key idea of the heap is to take a big chunk of linear space and provision smaller ports to consumers.

interface Heap {
  // ask for a region of memory of the given size
  long ask(int size);

  // free up the given memory at the position and the size
  void free(int position, int size);
}

Now, the exceptionally difficult long term problem of a heap is fragmentation as there is no expectation of predictability when allocations are freed. We must contend with fragmentation or the heap will grow infinitely which is problematic for finite spaces. The open question is whether or not Adama’s built-in snapshotting will drive a finite heap. An important metric would be measure fragmentation, and then send backpressure to Adama to snapshot more often. This is an area of study, but let’s first build the data model for the heap.

Done; check out the simple heap!

The heap is exceptionally simple in spirit as it just uses a double linked list. It’s designed to bias towards head compaction. It’s performance will degrade with exceptional fragmentation, but we start simple and measure later. The finite length is exceptionally important as well since memory mapped files use integers which bound their size to 2 GB. Going beyond 2GB is a matter of indirection and math, so we will do that at a later date.

Now that we can take a large chunk of disk and provision it, we now need an index mapping ids to lists and lists to regions of memory. Fortunately, this is exceptionally simple to model out:

Done; check out a very simple index!

At this point, we now need to write to the log and the disk. This dual writing has the problem of halving our capacity. However, durability is worth it, so we build a simple write ahead log which simply dumps operations to disk back to back, and then we apply the operation to the large file. The big question now is whether or not this is durable.

The failure model for the write ahead log is that it will be truncated on power loss, so as long as we only acknowledge writes once a flush has happened, then we just have to manage a rotating log. Since this write ahead log is going to manipulate a memory mapped file, we have to make sure the arrangement is sane such that the memory mapped file may be ahead of the log, so all operations within the log need to be idempotent. The rotation dance and when we delete the log is the most important consideration to make. The nice property of using a heap is that the allocations are deterministic when ordered, and replaying the log will reconstruct the memory perfectly.

The heap is an interesting data-structure to play with, so let’s combine all this together and create a rough benchmark.

Tick. Tock. Woah!

Where, we have a prototype that hasn’t been tested for correctness, but just combining all the bits together with no structure to the appends. This isn’t the final product, but enough to get a sense of directional feedback.

MB	MB/second	ms	writes/second
19.02	161.2	118	847458
19.13	347.87	55	1818182
19.13	375.05	51	1960784
19.17	368.66	52	1923077
19.23	409.09	47	2127660
19.14	285.73	67	1492537
19.12	398.28	48	2083333
19.18	408.14	47	2127660
19.17	304.27	63	1587302
19.16	281.72	68	1470588

This is over 10X the performance of RocksDB on my first go, so that’s pretty exciting. I’m not sure if it is correct (and it isn’t doing everything that I need it to), but now I grind on writing unit tests and building it out. Once I validate functionality is as-expected, the hard thing to test are the durability claims. I am debating whether or not I build a rig which I can open up and pull the SSD out on to validate data which gets ack is then durable. The easy test is that I build a little environment which kill -9 the process to validate the deployment concerns. Do this test several times an hour in the cloud should be interesting.

Work.

Accounting for failures modes, various unit tests, adding metrics, log rotation, snapshotting of organizational meta-data takes a toil, but the current candidate is pretty decent. It’s getting roughly 250+ MB/second. These values depend on the configuration, but this is forcing at least one log cut-over and several flushes. The log-cut over is exceptionally expensive due to moving the file. I can tune the parameters such that 350 MB/sec is achievable, but it starts to feel like benchmark shenanigans.

MB	MB/second	ms	writes/second
19.16	147.4	130	769231
19.23	274.64	70	1428571
19.1	276.83	69	1449275
19.17	262.54	73	1369863
19.15	299.15	64	1562500
19.18	290.59	66	1515152
19.12	224.99	85	1176471
19.1	203.21	94	1063830
19.19	236.96	81	1234568
19.17	270.05	71	1408451

That’s great!

This new “DurableListStore” feels pretty good so far, and the next step is to use it within Adama.

Use the new building block to build the Caravan

I’m naming the new data service “Caravan” because it sounds neat. Integrating with Adama started out great. The prior solution which would snapshot to the filesytem was less bad than I thought, so let’s provide some numbers of the prior write ahead log and compare to the new one. Here, we see a nice median latency of 11ms which “ok-ish”

Deltas	Sent	Send Fail	p5 Latency	p50 Latency	p98 Latency
0	0	0	-1	-1	-1
0	0	0	-1	-1	-1
18755	16237	16118	1	11	27
18995	16069	16313	1	11	32
18602	16060	15993	1	11	68
19045	16296	16364	1	12	32
18978	16234	16310	1	11	30
18578	16230	15971	1	10	30
19204	16248	16502	2	12	33
19031	16334	16348	1	11	30
18917	16298	16260	1	13	32
18837	16154	16180	1	11	31
18955	16232	16281	1	12	29
18515	16074	15915	1	11	36

Beyond the modest p98, the worst thing about the above solution was the multiple seconds of inability to process the first two seconds of load due to the thundering herd on the filesystem. Clearly, this solution was a stepping stone towards the current memory mapped solution with a durable heap. We run the same benchmark load against the new caravan data service:

Deltas	Sent	Send Fail	p5 Latency	p50 Latency	p98 Latency
23025	5088	4824	1	2	81
19064	16381	16318	1	2	23
19047	16249	16300	1	2	16
18852	16207	16131	1	2	14
19168	16327	16395	1	2	14
18794	16148	16081	1	2	14
18902	16096	16178	1	2	14
18841	16095	16122	1	2	14
18905	16242	16181	1	2	14
19014	16241	16274	1	2	14
18876	16193	16159	1	2	15
18875	16182	16154	1	2	27
18828	16121	16117	1	2	19
19109	16299	16351	1	2	15
19051	16352	16310	1	2	15
19013	16277	16271	1	2	15
19106	16314	16342	1	2	15

Not only is p50 in single digit milliseconds, but p98 is cut down by over half. This solution would eventually become problematic due to heap fragmentation and the performance of the double-linked list. Who could have imagined that using a linked list would be a problem… I replaced the linear heap with an indexed heap which uses three maps. This solution is both simpler, scales much better, and also provides relief from fragmentation by indexing free space (small allocations find small holes before creating new holes).

With this as the new data-store, beyond continuing to test durability claims during process crashes and power loss. The next step is to achieve long term durability in the event of a complete host failure.

The current plan is to archive each document to S3 every 30 minutes after a change. This will have to be good enough until I invest in either a raft-based consensus model, or I leverage plain old quorum based replication (and since all appends have a sequencer, I can reconstruct the log). I’m not sure which direction I want to go, but I do foresee needing leader election as a core primitive.

One thing that is worth considering is migrating documents between different storage systems. For example, the “in-memory data service” is blazing fast, and I could imagine some scenarios wanting to leverage it. The question is how should a space be configured to utilize different data stores. Having some kind of configuration to balance between latency, cost, and durability feels very important as it is exceptionally hard get a grand unified solution. The key challenge isn’t providing the api but handling the change when data exists in different models. I can punt most of the future work around doing the distributed system right from a consensus point of view if I get migrations sorted out.

So, anyway, this rambling post has left me needing to shift priorities and get this new thing integrated into Adama and bias away from RDS. Once I get satisfied with this body of work, I can move forward with gusto on the IDE aspect and UI.

March 27th, 2022 Writing to disk and the tyranny of small things By Jeffrey M. Barber

First, let’s sanity check

Looking into RocksDB

Design an Adama specific embedded “database”

Building a heap

Use the new building block to build the Caravan