Optimizing the CDN aspect

Since my plan is to build a game with Phaser.js rather than build yet another game engine; I need a web server. Fortunately, Adama is a server-less web hosting platform (yay?). However, I haven’t invested much time beyond “it works”. I ended January by writing a CLI uploader which makes it easy to upload a website from command line. At this point, I could start migrating away my various web properties away from surge.sh. However, I immediately discovered that my unoptimized solution is … unoptimized. At core, the latency was an insane 330ms! In this post, I’m going to share the journey of fixing issues and optimizing the web requests so Adama behaves much better as a CDN.

Taste it

The first task in optimizing anything is to simply measure. Here, I’m using apache benchmark (ab) since I am using normal HTTP requests within the browser, and this was when I learned despair.

ab -n 100 https://www-jeffrey-io.adama.games/index.html

which yielded p95% latency of 246ms. I fired it up in Chrome to see the network breakdown, and it was roughly

Task	Time
DNS Resolution	68 ms
Connecting	43 ms
TLS Setup	47 ms
Waiting	107 ms

Being a 100% SSL service is costing me greatly by adding roughly 90ms of additional latency. However, let’s get a baseline assuming we keep the connection open and talk to just the webserver. We do this by enabling keep-alive, and here we also throw more load against a hard-coded web asset. Since libadama.js is a hard-coded path within the webserver; this will measure the webserver’s performance alone and my path. This should be good and provide a north-star.

ab -k -c 10 -n 1000 https://aws-us-east-2.adama-platform.com/libadama.js

yielded a p95 of 36ms. Using ab and keep-alive (after fixing a few bugs in the web server, whoops), we can see the opportunity of optimizing beyond TLS. Now, we return to the user content which requires a bunch of work to resolve and run ab again.

ab -k -c 10 -n 1000 https://www-jeffrey-io.adama.games/index.html

yielded a p95 time of 330ms. This is greatly elevated over the initial run due to the concurrency boost. However, this informs me that there is around 300ms of opportunity just waiting which will affect both the quality and scale of the experiences of the HTTP side. We can ask if this is ok to launch, so we look at my alternative which currently powers this site.

ab -c 10 -n 1000 https://www.adama-platform.com/index.html

yielded a p95 time of 245ms. Interestingly, keep-alives didn’t work against surge; I feel less bad now the bugs I fixed. From a competitive stance, success requires me to simply aim to reduce about 150 ms of what I have now, and I believe I can do that with gusto within the benchmark context.

Note, the benchmark can be cheated via many caches, but such caches are required when dealing with thundering herds (like a HN hug of death). We will talk about the various trade-offs and potential features to offset the negatives.

Peel the onion; caching certificates

The user makes a request and it will land on a web-tier box. The web tier will create an Initializer which installs an SniHandler to lookup the domain against a CertificateFinder. The production certificate finder will lookup the DNS in MySQL if it is not found within the the local cache. While this was cached, it was never invalidated. I replaced this with an LRU cache which also expires on time. This didn’t affect performance, but it improves the service.

Capping the cache ensures finite memory and caching on time is a stop-gap measure to minimize the blast radius of an invalid certificate. Over time, the magic parameters will need to be adjusted. Furthermore, since we expect these to be stable over time; we will want to migrate the certificate to the local disk. However, this will require more care as an invalid certificate becomes even more problematic.

Thoughts about caching the Adama request

The WebHandler will get a fully aggregated and parsed request; since the request isn’t any internally handled path (like /libadama.js), it is forwarded to FrontendHttpHandler. There is a bunch of logic here to route to the appropriate document, but the current case falls into the category of matching a space domain ($space.adama.games) which means evaluating against the ide/$space document..

Here is where we have a bunch of painful latency. First, we get the space id to then look up the RxHTML document to test for a path to return the generated app shell. This logic, when removed, accounts for approximately 10 ms. I need to reflect on whether to remove this or move it to the ide/space document. Ignoring this for now…

Executing a request against a document requires the web tier to invoke a webGet via the Client. webGet then looks up which host has the document which has an uncached call to the database as finding the document requires a MySQL lookup. Once the host is known, the client forwards to the appropriate Adama host (assuming within region).

Caching the host is challenging as we need to invalidate the cache when a failure happens. However, what we can do is cache this entire process within FrontendHttpHandler. This has its own set of problems, but the goal is to produce data for now.

  private void get(SpaceKeyRequest skr, TreeMap<String, String> headers, String parametersJson, Callback<HttpResult> callback) {
    if (skr != null) {
      String cacheKey = skr.space + "/" + skr.key + "/" + skr.uri;
      HttpResult result = cache.get(cacheKey);
      if (result != null) {
        callback.success(result);
        return;
      }
      WebGet get = new WebGet(contextOf(headers), skr.uri, headers, new NtDynamic(parametersJson));
      client.webGet(skr.space, skr.key, get, route(skr, new Callback<HttpResult>() {
        @Override
        public void success(HttpResult value) {
          cache.put(cacheKey, value);
          callback.success(value);
        }

        @Override
        public void failure(ErrorCodeException ex) {
          callback.failure(ex);
        }
      }));
    } else {
      callback.success(null);
    }
  }

Introducing this badly written perma-cache and then running the benchmark quickly

ab -k -c 10 -n 1000 https://www-jeffrey-io.adama.games/index.html

yielded a p95 latency of 290ms. This is surprising given the number of required lookups, so I fiddled with the concurrency. Changing the concurrency from 10 to 5 yielded a p95 latency went of 135ms. I backed the change out and reran the prior version at the new concurrency level which had a p95 latency of 170ms. This gives me the sense that this perma-cache saves approximately ~30ms of latency. Given how radically incorrect this cache is, I’ll back it out and look elsewhere.

Since the concurrency is the primary driver of latency, the next level of the onion is pulling assets from S3. Very quickly, I updated the WebHandler to cache assets directly to memory as this will skip the redownload from S3.

ConcurrentHashMap<String, byte[]> htmlCache = new ConcurrentHashMap<>();
  private void handleNtAsset(FullHttpRequest req, final ChannelHandlerContext ctx, Key key, NtAsset asset, boolean cors) {
    AssetStream response = streamOf(req, ctx, cors);

    if (asset.size < 128 * 1024 && asset.contentType.equals("text/html")) {
      String cacheKey = key.space + "/" + key.key + "/" + asset.id;
      byte[] cached = htmlCache.get(cacheKey);
      if (cached != null) {
        response.headers(asset.size, asset.contentType);
        response.body(cached, 0, cached.length, true);
        return;
      }
      ByteArrayOutputStream buffer = new ByteArrayOutputStream();
      assets.request(key, asset, new AssetStream() {
        @Override
        public void headers(long length, String contentType) {
          response.headers(length, contentType);
        }

        @Override
        public void body(byte[] chunk, int offset, int length, boolean last) {
          response.body(chunk, offset, length, last);
          buffer.write(chunk, offset, length);
          if (last) {
            htmlCache.put(cacheKey, buffer.toByteArray());
          }
        }

        @Override
        public void failure(int code) {
          response.failure(code);
        }
      });
      return;
    }

    assets.request(key, asset, response);
  }

This resulted in a p95 latency of 90ms. It’s worth noting that after I ran this experiment and moved on, the above implementation is wrong as the caching only happened within a single connection. Whoops, but even with this blunder, it points to a way out of this mess.

This kind of caching is a bit like cheating as you really want to minimize the initial latency, but we have a clue that the connection from the web box to S3 was problematic. I backed out the cache hack, and I re-ran the benchmark. This time, I jumped onto the production host and looked at CPU. It was 100%.

I didn’t look at the stack heaps, but given the prior issue with SSL performance; I realized I’m not pooling the connection to S3. Every asset request required a unique connection to S3 thus delaying latency with the expensive handshake. Yikes!

Adding connection pooling from the web tier to S3 with a new asynchronous connection pool was fruitful. This took the p95 latency to 145ms at full concurrency. At a steady state, this only helps the steady-state of the box at scale as it reduces the SSL overhead of pulling assets. It’s a great win, but we can do better!

We now formalize the caching with policies and the ability to cache to either memory or disk. Using the new async cache, the tricky part is dealing with the thundering herd aspect. This required a semi-informal pub/sub model where multiple readers can get updated with a single writers. We force all calls to evaluate against a new CachedAsset where we have a memory version and file version. The current policy is to use memory for small html files, disk for medium size files, and then we don’t cache large files.

This policy directed memory cache has a p95 lantecy of 64ms while the disk cache has a latency of 75ms. The beauty of caching assets is that they are immutable, so this is a perfect solution. If we stop here, then the solution feels competitive.

Going the full distance

The final thing to optimize is the pathway from the web tier to Adama which has dragons. We first have the cache invalidation as Adama documents may change along with the cached documents. Second, we have the privacy of the get response to deal with as the Adama space may return different results for different users. Third, the billing system assumes every get request will be talked to by Adama, so caching requires the web tier to submit billing records. Fourth, we should think about the thundering herd as we want Adama to be able to sustain large traffic.

With the S3 pathway cleaned up, let’s re-evaluate the potential of caching the call. Re-introducing the FrontendHttpHandler hack, the p95 latency is now 55ms. That is exciting stuff, and we can pump the concurrency up a lot.

-c	-n	p95	CPU	req/sec
10	1000	55 ms	40%	237
20	5000	65 ms	60%	464
50	5000	124 ms	95%	708
100	5000	234 ms	100%	688
100	50000	151 ms	100%	859

It’s worth noting that the web tier is one a1.medium host which costs $0.0255 an hour. By caching the Adama request, I’m basically back to measuring the SSL performance of the host assuming no connection churn. Instead of profiling, we can ask if this is a good deal.

On a per-request basis, it feels like a fantastic deal at $0.0255; assume a steady state of 200 req/second then that host will deal with 720K requests yielding a request cost of $3.6e-08 which might as well be zero. However, it is important to look upstream as I’m using a network load balancer (NLB) rather than an application load balancer (ALB). The NLB is significantly cheaper than the ALB because it only handles TCP which allows (and forces) the web tier to terminate SSL.

Both NLB and ALB have a base price of $0.0225 hourly cost, but then charge differently for usage units (LCU). A LCU for an NLB costs $0.006 while an ALB costs $0.008 which doesn’t sound like much until you consider the units are different. NLB allow for 800 connections/second while ALB allow for 25 connections/second. NLB allow for 100K active connections while the ALB allows for 3K connections/second. There are some other differences as well which can jack with a comparison, but focus now on connection churn and volume. The core reason for this gap is that a NLB is pure TCP while the ALB terminates SSL. Security costs!

ALB will consume 32x the LCU over an NLB while also at a higher price point. Ignoring the benefit of hiding traffic from Amazon, we can then guess the price of the workload the web tier is doing versus what an ALB would be using. Using the concurrency of 100, this would translate to four LCU hours at a price point of $0.032 which is more than the instance I’m using. This suggests that I’m competitive, and I can skip a deep dive into the SSL tech I’m using.

The key body of work to continue to optimize the CDN is to add features which give cache control over the the authors of spaces. For now, I’ll back out the FrontendHttpHandler hack. I’ll revisit this once I have multi-region support and a centralized database; this will escalate the cost of doing lookups and talking to Adama.

February 10th, 2023 Optimizing the CDN aspect By Jeffrey M. Barber

Taste it

Peel the onion; caching certificates

Thoughts about caching the Adama request

Going the full distance