March 8th, 2022 So, I replaced gRPC with net and discovered more about latency By Jeffrey M. Barber

As I’ve been mentally tortured by poor performance (and my inability to prioritize), I replaced gRPC with vanilla netty and very simple code generated codecs. Today, that effort went to “production” and the results are…

Good!

concurrent streams grpc p95 latency new p95 latency
400 65 ms 54 ms
800 75 ms 67 ms
1600 400 ms 260 ms
3200 2100 ms 1200 ms

Obviously, this is still not great. However, I’ve cut out a major source of CPU usage on the web proxy. When profiling locally using the new canary tools, the CPU for networking dropped by 50% as well. The new canary tool to compare gRPC to the new vanilla netty also revealed the major issue with the data service design.

Namely, compaction is expensive. As of now, if your document’s history is 100,000 items, then it doesn’t make sense to load 100,000 items from the log to build it. This is why compaction came into being as a mechanism to balance history with recall. The current implementation is problematic since you need 15,000 for compact to take effect. This means that it will require 19 calls to compact the data which means scanning 285,000 items. Asymptotically, waiting for 50% of the maximum history prior to compaction will yield 3x reads (1.5 work / 0.5 change).

Given those 15,000 reads block the data service, this means that latency gets added at compaction time. If the service is humming along just fine, then compaction comes in to stagger everything which cascades. Clearly, that’s not great. It’s especially not great to put within a transaction.

This requires a fundamental rethink of the role of compaction and how to achieve better latency. This is going to be a modest change for everything except the MySQL implementation which will require a schema change. For now, I’ll do some cheesy hacks work to do this work smoothly as I intend to migrate to a very simple write ahead logger.

Fun times!


As an update on durability, my thinking about S3 is shifting as well. Using the table format from launch, it’s clear that I have to be careful before I jump from a cheap place into a super expensive place.

Service Rate Units Used Total
AWS S3 $0.005 per thousand PUTs 576 kPuts = 576,000 plays / 1,000 PUTs $2.88

This illustrates an important question of WHAT do I put into S3. Clearly, I can’t put every delta in as that would exceptionally expensive! What if I archived each document every hour?

Service Rate Units Used Total
AWS S3 $0.005 per thousand PUTs 1,600 games/hr / 1,000 PUTs $0.002

That’s much better, but now durability is at risk if I archive after an hour. I have much to think about.