← All posts

Building a lookup API for 400k–2M requests per second

We built an API for a new business operation. A business partner consumes it. The contract is simple: a request carries a payload of signals; each signal yields a key; each key is looked up in a large dataset; every record found is returned.

The dataset is roughly 2 billion records — about 1.5 terabytes — across multiple sources, each source with its own retrieval pattern. The traffic grows with the business operation, toward about 2 million requests per second at full scale; the system as described here runs at 400,000 per second.

The sections that follow are the build record of that system — enough detail to rebuild it and adapt it to your own drivers. Each section covers one component of the request path, ordered from the entry point toward the hardware, in two parts — the implementation, then the rationale behind it — with each component and concept explained as it appears; the configuration values are collected in a settings appendix at the end.

The architectural drivers

Designing a system starts with establishing its boundaries: the limits the environment imposes on it. They act before any design decision — they restrict which options are available — so no other input has a more direct effect on the design.

Four architectural drivers shaped this system.

  • Latency — a fixed 5-millisecond timeout, end to end, at p95. Every component on the request path spends a share of it, and that share shaped each component’s design.
  • Deadline — fixed by a sales season. It bounded the design space to what we could build in time.
  • Operating cost — flexible. Lower was better, within a ceiling set by the business scale. We traded availability for it where the fixed drivers held.
  • Yield — flexible. The match rate: the share of looked-up keys that find a record. Each record returned is worth revenue, so a higher rate is worth more. It was traded where the fixed drivers required it.

The fixed drivers defined the boundary of the design space; the flexible ones selected a design inside it.

Section 1 — splitting the 5ms

The implementation

The budget was 5 milliseconds, end to end, at p95. The path has three layers:

  • client → API: everything on the wire before our own work begins.
  • the API business logic: parse the payload, dispatch the lookups, build the response.
  • data retrieval: pull the records from the data sources, across the ~2 billion records.

We held the two app-side layers to a reference of about 1ms each, 2ms together. The network took the remainder.

The call runs between two services in two different datacenters: the partner’s service, on-prem, and ours, in AWS.

The rationale

The 5ms was fixed. We started from it and designed the whole system to fit inside that limit. Inside that window the whole flow has to complete: the request travels from the partner to our app, the app runs its logic, it calls the data sources to fetch the records — and then the whole thing travels back the other way to the partner. Each layer’s cost was unknown in advance. The split came from two things: where the time was most at risk, and how much control we had over each layer.

Latency targets are set as percentiles. p95 means 95% of requests finish under the limit; the slowest 5% are allowed to exceed it. The contrast that matters is with p99, where only 1% may exceed. Factors like virtualization, contention, queuing, and GC pauses contribute far more to p99 than to p95, which is why p99 is much harder to hold.

Percentiles explained: p50, p95, and p99.

That rate-cutting is a circuit breaker, the client’s protection against a slow dependency: when a service it calls starts timing out, the breaker opens and the client cuts the request rate until that service recovers. Here, it opens once more than 5% of the responses we return run over 5ms — our p95 limit. The request rate to us then drops and recovers toward full as our latency returns under the limit. A brief open is acceptable. Frequent or long opens have cost implications and are not ideal.

The ideal setup would be to colocate both services in the same datacenter, simplifying the network infrastructure the communication runs over. That wasn’t achievable under the deadline — standing up edge locations inside the partner’s on-prem datacenter was too complex in the time we had — so we stayed in AWS and treated the cross-datacenter network as a fixed cost.

So the split. The two stages we run inside AWS we could reason about: about 1ms each, 2ms together — the floor for an app-level budget: timeouts and cancellations are configured in integer milliseconds, in the runtime and in the client libraries, so 1ms is the smallest budget that can be enforced. The client→API leg holds the most unknowns, the most complexity, and the least control on our side, so it took the remainder, about 3ms. Whether the network fit inside it was the open question.

Section 2 — proving the network

The implementation

Before we built anything, we wanted to know whether the network would hold.

The 3ms leg was the one we understood worst, so we settled it first. We started with an empty endpoint — the smallest thing that could still answer the question: receive a request, deserialize it, return a mocked response. No lookups, no data sources. Whatever latency we measured was the network and nothing else.

That minimal version was the gate. Ideally it would sustain the request rate inside 3ms; at minimum it had to come in under the full 5ms. Pass that, and the project was worth continuing.

The partner had global presence; four metro areas were ours to serve, and the API had to sit close to each one. That gave us four deployments and four separate network paths to validate. A different location means a different distance, different peering, different routing — a different network, each one validated on its own:

  • 2 locations — served by an AWS region.
  • 2 locations — served by a local zone, where no region sat close enough.

The local zones were new to us. None of us had run anything on them before.

The rationale

The two services sit in the same metro area, and with no Direct Connect between them, the traffic crosses the public internet. A single round trip passes through intermediaries such as:

  • the client’s NIC and OS network stack
  • the on-prem datacenter’s LAN switches
  • the on-prem edge router and firewall
  • the datacenter’s uplink to its ISP
  • a local internet exchange or peering point, handing traffic off toward AWS
  • AWS’s edge into the region
  • the load balancer

That’s a representative chain, not an exact one — a real path crosses more hops, can cross several networks, and isn’t always symmetric. Every hop adds time, and most of them sit outside our reach. We tune our own stack and our load balancer. The rest belongs to the partner, the ISPs, and the public routing in between.

(AWS Direct Connect — a dedicated private link into AWS — would have traded that variable public routing for a stable path. Standing one up didn’t fit the deadline, so we accepted the public internet and designed around it.)

The one part we could shape was the fiber the signal travels, and the lever was where we deployed. Light moves through glass at about two-thirds of its speed in vacuum — roughly 5 microseconds per kilometer one way, 10 round trip. That’s a floor: hardware tuning doesn’t move it, and every kilometer of fiber spends budget we can calculate in advance. Fiber doesn’t run in a straight line either, often one and a half to two times the map distance, so the floor is real even in-metro. Deploying in the same metro as the partner is what keeps it short.

Everything else is congestion, and congestion moves. Sharing the public internet means sharing its busy hours, and we expected the path to see contention that distance alone doesn’t explain. The floor we calculate from geography; the contention we have to measure — a budget that holds at a quiet hour and breaks at a busy one isn’t one we can trust.

Section 3 — the load balancer

The implementation

Inside AWS, the first thing a request reaches is the load balancer.

The four deployments did not get the same load balancer:

  • AWS regions — Network Load Balancer (NLB).
  • AWS local zones — Application Load Balancer (ALB). NLB is not available in local zones.

Two settings applied everywhere:

  • Single Availability Zone (AZ) — we made sure the load balancer is deployed in one AZ only: the one the app runs in.
  • TLS 1.3 — the default listener policy outside the console negotiates at most TLS 1.2; we upgraded it. Termination stays at the load balancer.

The rationale

The app doesn’t run as one server; it runs as many replicas. The load balancer is the one address in front of them, distributing incoming requests across the targets. Its time falls in the client→API leg; the app’s budget begins once traffic reaches the handler.

NLB works at layer 4: it forwards TCP and inspects nothing above it. ALB works at layer 7: it terminates HTTP, parses the request, applies routing rules. More work per request means more time per request, so we picked NLB where we could. In the local zones, ALB was the only option.

The layer also sets how load spreads. NLB distributes per connection: each TCP connection is routed to one target for its life, so with long-lived connections a few heavy clients stay on the same pods and load can sit uneven across them. ALB distributes per request: it terminates the connection and the routing decision is per request, so load spreads across pods regardless of how the connections are distributed.

Less work also lowers operating cost. Both charge the same base rate; the variable cost follows the work the load balancer does, and forwarding packets is the smallest unit of work. At our request rate, the NLB deployments cost roughly 30% less.

An Availability Zone is a datacenter; an AWS region is a group of them, close together but physically separate. The load balancer isn’t one machine: it runs in every zone you enable, one address per zone, and clients pick an address through DNS — we don’t control which. Our app runs in one zone, so the load balancer got that zone only: an address elsewhere would mean traffic crossing between datacenters to reach the app, extra time plus a per-gigabyte charge each direction. One zone avoids that.

This is also the third reason for NLB in the regions: it accepts a single zone. A region ALB requires at least two, a rule AWS enforces for the load balancer’s own availability: when one zone fails, the nodes in the other continue to route. For us it means an address in a datacenter without the app, sending its share of traffic across. In the local zones the rule doesn’t apply, and one zone was enough for the ALB too.

The tradeoff is redundancy: a zone failure takes the deployment down. Multi-AZ removes that risk, at an operating cost the business scale at the time didn’t justify. We accepted single-AZ, a decision to revisit as the business scales. The local zones vary by metro — some have one zone, some two. Where there was one, the question doesn’t arise; where there were two, we made the same decision as in the regions.

TLS 1.3 is about round trips. A TLS 1.2 handshake takes two before any data moves; TLS 1.3 takes one — and we knew what a round trip costs on our path. Connections are persistent, but they still open, and each open pays the handshake. Terminating at the load balancer keeps the cryptographic work out of the app: the handshake ends at the first hop inside AWS, and the app receives plain traffic over the internal network — none of the app’s budget is spent on it.

TLS handshake and connection reuse.

DNS stayed out of the budget by design. DNS maps a name to the IP addresses behind it. Before a client can connect, it resolves the load balancer’s name to an address, and an uncached resolution is a chain of round trips that alone exceeds the 5ms budget. The resolved address is cached on the client host — in the application runtime or the OS resolver, checked before any query leaves the machine — and reused until the record’s TTL expires. A resolution is paid at connection open, at most once per TTL window, and requests on an established connection reuse the address — the cost is amortized over every request inside the window. The per-request cost is negligible.

Section 4 — the API

The implementation

The last component was the API itself: a .NET 10 minimal API on Linux, with two endpoints — one for the client, deserializing the request and returning a mocked response, and one for the health check.

One runtime setting:

  • Inline scheduling — enabled.

The rationale

A minimal API is the smallest hosting model ASP.NET Core offers: no MVC pipeline, no controller activation, no filter chain — a route mapped directly to a handler. Every middleware component a request passes through spends budget, so the pipeline carries only what the contract requires: deserialize, respond. The health check endpoint exists for the load balancer; it does no work.

Kestrel is the web server built into ASP.NET Core. It runs in the same process as the app: it reads the bytes off the socket, parses them into an HTTP request, and calls the handler. The settings below decide which thread runs that work.

On Linux, .NET handles socket I/O on a set of engine threads, owned by the sockets engine and separate from the thread pool. .NET names them ”.NET Sockets”; this article calls them the engine threads. Each runs an epoll loop — epoll is the Linux facility for watching many file descriptors at once and reporting which ones are ready for I/O — so one thread can wait on many sockets at a time. The default count is the processor count divided by a per-architecture value — 30 on x64, 8 on Arm64 — with a minimum of one; at our node sizes that is one, dispatching completions to the thread pool.

Inline scheduling changes where the request runs. By default Kestrel dispatches the continuation off the engine thread, which keeps a slow handler from blocking it. Ours is short and non-blocking, so we run it inline on the engine thread that read the socket, saving the dispatch on every request. The cost is that a blocking call there would stall the engine thread and everything behind it, so the handler has to stay non-blocking. One engine thread would then process requests one at a time, so with inline scheduling on, .NET raises the default count to the processor count.

Section 5 — the cluster

The implementation

The app runs in an EKS cluster. On the compute side:

  • Node size fixed to 2xlarge EC2 variants — multiple families (compute, general, memory) and architectures (ARM, AMD), all flex instance types filtered out. Karpenter provisions the cheapest available across the set.
  • One pod per node — no two pods share a node.
  • CPU limits removed from the pod spec.
  • Autoscaling at 50% CPU — when average CPU crosses it, new pods are added.
  • File descriptor limit (nofile) — ~1 million per container, set by the node image at the container runtime level.

The rationale

The app doesn’t run on machines we manage by hand. It runs on Kubernetes — EKS is AWS’s managed version of it. Kubernetes runs applications as pods: a pod is one running copy of the app. Pods are placed on nodes, and a node is an EC2 machine. The placement is automatic: Kubernetes decides which pod lands on which node, and it accepts rules that constrain the decision — which nodes a pod may land on, and which pods may share one. Karpenter is the component that creates the nodes themselves — when pods need a machine, it provisions one; the constraints we set tell it which machines it’s allowed to create. EC2 sells the same machine under several purchase models. On-demand: the list price, no commitment. Spot: spare capacity at 70–90% below the on-demand price, which AWS reclaims on a two-minute notice when it needs the capacity back. Spot is preferred; when spot capacity is unavailable, on-demand is purchased.

Autoscaling is how the deployment follows the traffic. The metrics server collects CPU usage from every pod; the autoscaler compares the average against the target and adjusts the pod count. We set the target at 50%, and the number deserves explanation, because it interacts with where the app runs. An EC2 instance is a virtual machine on shared hardware: the hypervisor and neighboring tenants introduce latency variance that a dedicated on-prem machine doesn’t have. The variance is invisible at low utilization and compounds at high utilization — in our measurements, tail latency grew sharply as CPU approached 90–100%. The 50% target keeps the pods far from that zone and leaves headroom for traffic to grow while new pods start. It’s a starting point, not a law; the right target is the one your own tail latency measurements support.

The node size is about consistency. A 2xlarge gives the pod a fixed allocation of vCPUs and memory. The size is set by memory: the smallest 2xlarge variant in the set carries 16 GB, enough for the in-memory data the app preloads and refreshes (Section 7). The allowed set spans families — compute, general, memory — and architectures — ARM, AMD — and Karpenter provisions the cheapest available across it, serving the operating cost. The variants differ in absolute speed, so the set is sized to its slowest member: every allowed variant meets the latency budget. Flex instance types are excluded because their CPU performance is variable by design — they target average utilization, while a latency budget has to hold on the slowest requests. A p95 target needs steady CPU behavior within each node, so the constraint allows Karpenter only dedicated, non-burstable compute.

The width of the set is for the local zones. A local zone offers a subset of the instance types a full region does, so fewer spot pools stand behind it. A wide set — families, architectures, generations — lets Karpenter draw from more of them, and in a local zone it keeps compute available when a single pool runs out.

One pod per node removes neighbor contention. Two pods on the same node share the CPU caches, the memory bandwidth, and the NIC. None of that sharing shows up in CPU metrics, and all of it shows up in tail latency. With a single pod per node, the entire machine belongs to one workload: its cache lines, its network queues, its memory channels.

Removing CPU limits removes throttling. A CPU limit in Kubernetes is enforced by CFS bandwidth control, the kernel mechanism that caps how much CPU time a group of processes may use per scheduling period. The container gets a slice of CPU time each period, and when the slice is spent, every thread in the container stops until the next period. The stall is measured in milliseconds — larger than our entire app budget. CPU requests stay in place, so the scheduler still sizes placement correctly; with one pod per node there is no neighbor to protect, and the limit would only protect the pod from itself. The pod can use the whole node, which is exactly what it was given the node for.

Every open socket is a file descriptor, and at 400,000 requests per second a pod holds a lot of them. A default Linux host caps open descriptors at 1024; the node image raises it to roughly a million. The point is scale: at 1024 a pod hits its connection ceiling long before its CPU ceiling, so the deployment would need far more pods than the work requires. The high limit lets CPU govern pod count — the 50% target set above — so one pod carries its full share of the request rate. It spends none of the latency budget.

Section 6 — the rollout

The implementation

The test passed across all four deployments. Each sustained the request rate, and end-to-end p95 landed under 3ms in every region — the target we designed for. The exact figure varied from region to region.

The rationale

End-to-end latency — the number the 5ms is measured against — could be read only on the partner’s side, where the request starts and the response lands. Access to it ran through our counterpart: we requested the measurements by email, one step at a time. Ramp a step, hold it, ask for their numbers, read them, move on.

The endpoint under test was the mocked one: deserialize the request, return a mocked response, no business logic and no data sources behind it. The path it exercised was the network and the app — the 3ms client→API leg plus the handler. The data layer comes later in the system and later in this article; this rollout proved the front of the path.

We rolled out one region at a time. We gave the partner a direct endpoint for each regional API, and they ramped the request rate toward us in steps, holding each for one to two days:

  • 1k rps — latency.
  • 10k rps — scale.
  • 100k rps — scale.

The order follows from what each requirement needs. Latency shows at low load, and the rate doesn’t move it: 1k rps was enough to read the floor. Scale needs volume — 10k and 100k drive the autoscaler, exercise Karpenter’s provisioning, and confirm the 50% CPU target holds the tail as pods are added under partner traffic. The hold durations followed the partner’s rollout process.

For observability we added a request counter to the mocked API and read that, rather than the load balancer metrics yet. ALB terminates HTTP and reports request counts directly; NLB works at layer 4 and does not — a counter in the app read the same in every region. Synced against the partner’s send rate over email, it confirmed the requests reached us, not that the responses reached the client. For this stage that was enough.

The p95 varied by region for two reasons. The network: each region serves a different metro over a different path — distance, peering, routing — so the client→API leg differs region to region. And the AWS setup, which the earlier decisions narrowed to two factors. The load balancer: regions run NLB at layer 4, local zones run ALB at layer 7, and the layer-7 work adds time. The hardware: a local zone’s newest available generation typically runs two to three behind its parent region’s. The local zones here ran older-generation EC2 than the regions, and a newer core finishes the same work faster.

Section 7 — the API logic

The implementation

The API runs two ways: the endpoint flow that answers each request, and the background tasks it schedules.

The endpoint flow covers the two app-side parts of Section 1’s split — 1ms logic, 1ms retrieve — 2ms in total. It runs inside that:

  • A request arrives with its payload of signals.
  • Each signal yields a key, taken from the payload or derived from it.
  • The keys go out as a batch lookup — up to ~10 keys per batch, a 1ms retrieve budget. Some resolve against the main data store, some against in-memory data.
  • The records found are transformed to the output form and returned.
  • When a key has no record, what happens depends on the key: some pass by, nothing further; for others, the data has to be created, and a background task is scheduled. Scheduling does not block the response.

The background task runs outside the response path. What it does depends on the key type:

  • it reads the value from a data source slower than the main store and writes it to the main store. The next request with that key finds the record there and returns it.
  • it hands off to another service of ours, outside this API, that can take minutes to produce the data. A frequency check gates the handoff: the key must be scheduled a set number of times before the task hands off. After the handoff it writes a placeholder record for the key to the main store, so a later request finds it and schedules nothing further while that service runs. The real data is written by that service when it finishes.

The rationale

Two of the drivers governed the data-access design.

The first is the 2ms endpoint budget. It admits only the fastest retrievals: key lookups against the main store and against in-memory data. The main store holds most of the data — the large, dynamic set; in-memory holds the static set, loaded at startup and refreshed on a fixed interval. The in-memory set has a capacity bound: the node can land on any of the 2xlarge variants allowed in Section 5, and the smallest carries 16 GB, so the set is sized to that floor and loads on whichever variant gets provisioned.

The second is the deadline. It was fixed and short, so the work of supplying data missing from the main store stays inside the API. The workload allows it: the keys recur, so a key that misses now returns later, and the cost of supplying it is amortized over its later occurrences. Data the 1ms lookup cannot return is supplied in the background — fetched from a slower source and written to the main store, or, when it has to be generated, handed to that service, which writes it when finished. Generating data carries a cost, so the frequency check gates the handoff: a key is handed off only after it recurs enough times to justify that cost. The next request for that key then returns on the hot path.

Keeping this flow inside the API raises tail latency. The deadline forced it: a separate system to own the fill would add complexity, and building it would have exceeded the deadline.

Section 8 — the thread model

The implementation

The request path runs on the engine threads, each request on the thread that read its socket. Background work is dispatched to the thread pool.

The data store client exposes its operations two ways, synchronous and asynchronous. The request path calls the synchronous operations, on the engine thread. The background tasks call the asynchronous operations.

This arrangement was the lowest in both compute cost and latency. In a region with gen 9 available, two general-purpose gen 9 nodes handled 100k rps. The node count for the same load rises as the generation drops — for example, gen 5, still present in some local zones, takes 10–12. The business logic compute is negligible.

The API’s request-response latency sits below 1ms at p95 and around 2ms at p99, varying by region. We can read this only on the local-zone deployments, which run ALB and publish a per-request latency metric.

The rationale

The default configuration met the requirement. One engine thread reads the sockets and dispatches each request to the thread pool, which runs it; that held p95 inside the budget. The inline configuration was pursued for the improvement in latency and compute cost.

The thread pool is built for workloads that change over time. It keeps a set of worker threads, runs queued work on whichever is free, and sizes itself to the load: a controller adds and removes threads to hold throughput at the fewest threads, adjusting at most once per completed item or per 500ms.

That adaptation has a cost. When load rises, the current threads carry it while the controller catches up, so work waits in the queue until it does. Each dispatched item also crosses a thread boundary: a context switch and a cold cache. Across a varied workload these costs are amortized over the throughput the pool sustains, and they stay invisible.

The 2ms budget removes that amortization. The budget is close to the scheduling overhead itself, and the pool’s adaptation pays off only under variable load. The request path is steady and uniform, so that overhead provides no benefit here. A fixed set of threads running one kind of work fits the regime.

That fixed set is the engine threads. Each runs an epoll loop over many sockets. On the request path the work on the engine thread is the data store lookup.

The data store client provides that lookup in two forms, synchronous and asynchronous, and the thread behavior of each is specific to its implementation. The two forms differ in whether the lookup passes through the thread pool.

The asynchronous form hands the store send to the thread pool: the store client queues the send to a worker, the worker starts it, and the engine thread stays in its loop serving other sockets. The store’s response completes back on the engine thread, so the pool runs only the send. That hand-off is the cost. Under load the send waits in the pool’s queue before it runs, and the 1ms retrieve budget counts that wait — the lookup can spend its budget in the queue before any network I/O to the store begins.

The synchronous form runs the store call on the engine thread and blocks it until the store responds. A multi-node batch queries the store nodes in series on that thread, so it can block across several round trips, all sharing the one budget. There is no thread pool and no queue, so the full budget goes to the store I/O. The thread is held for the duration, so the default engine thread count is not enough: with some threads held on synchronous lookups, a higher count keeps the rest serving sockets. We set that count from testing.

The background work is the part the budget does not cover, and it stays off the request path. When a lookup misses a key that needs filling, the request path writes the signals for that fill to an in-memory channel and returns. That write does not block: the channel is bounded, and when it is full the write drops the new signal and the request path continues. The keys recur, so a dropped signal returns on a later request and the fill runs then.

A hosted background service, running on the thread pool, reads the channel, builds each fill task, and runs it under a concurrency limit that caps how many run at once. The limit is adjustable at runtime. The tasks are the fills described in Section 7 — a read from a slower source written back to the main store, the frequency checks gating a handoff, and the handoff to that service followed by a placeholder write.

The cap exists because of where those tasks go. The services they call run in other regions, so each carries the distance latency of Section 2, and they are not built for this request rate. The concurrency limit and per-service QPS guards hold the background load within what those services accept.

Section 9 — the main data source

The implementation

The main data source is an Aerospike cluster.

In each region we deployed one edge cluster, on the same architecture we already ran elsewhere in the infrastructure. It is the destination of active-passive XDR replication: a source cluster ships to it, and the edge cluster serves reads.

The local zones are different. There the local edge cluster has no XDR replication, and the data is served a different way — covered in the next section.

The rationale

Aerospike was the first choice. Most of the data already lived there, we had run it for years, and we had the most proficiency with it. In the right setup it returns reads under 1ms at p99, and reads are what this API does.

We call these edge clusters: each sits close to the metro it serves and answers the API’s reads. Each cluster is two nodes, hybrid memory, two NVMe drives per node, matching the clusters in our existing infrastructure. The regions this API serves are new to us, and matching that architecture kept the data layout and operational behavior consistent.

Two nodes is the minimum that survives a node loss. Aerospike distributes data across the nodes by partition; redundancy comes from the replication factor, which places a copy of each partition on another node. At replication factor two with two nodes, each node is master for half the data and replica for the other half, so each holds a full copy. If one node fails, the other promotes its replicas to master and serves the whole dataset at reduced capacity until the failed node returns and the cluster rebalances.

We run Aerospike in its hybrid memory configuration: the primary index in DRAM, the record data on the NVMe drives. A read resolves the key against the in-memory index, which holds the exact device and offset of the record, then issues one read to that location.

Keeping the data in DRAM as well would hold the whole ~2 billion record set in memory, which is larger than the DRAM on the nodes. Keeping the index on flash with the data would add a device read to every lookup, raising read latency. The hybrid split keeps the lookup in memory and reads only the record body from the device, which fits the data size and the retrieve budget together.

Aerospike reads the data drives as raw block devices with Linux direct I/O, bypassing the OS page cache. Every record read is a direct device read at a fixed cost, independent of which records were read before. Under a 1ms retrieve budget that predictability matters: the same read costs the same time on every request.

The data drives are instance-local NVMe, physically attached to the node, which is what holds each read at the drive’s microsecond latency and keeps the per-read cost fixed. Network-attached block storage such as EBS routes every read over the network, adding latency and variance to each one. The cost we accept for local drives is coupling: they are ephemeral and bound to the instance, so storage and compute scale together and the data lives only as long as the node. The in-cluster replication and the source cluster behind each region edge cluster make a lost node’s data recoverable, so the coupling is acceptable.

Each node carries two NVMe drives, and Aerospike distributes records across both by a device-level hash. This is separate from the partition map: the partition map distributes the 4096 logical partitions across nodes, while the device distribution operates inside a node, across its drives. Each drive serves reads independently, so two drives give roughly double the read capacity of one.

For this workload the NVMe drives are the first resource to saturate. The index lookup is served from memory, but the record body is read from the device, so every read that misses the in-memory set is one device read. Device read IOPS scale with the read rate against a finite per-drive ceiling, and under a high rate of small random reads the drives reach their ceiling before CPU or network reach theirs. As the drives approach it, reads queue and read latency rises; past it, Aerospike returns device-overload. Two drives raise the ceiling; adding a node raises it further by spreading the data and the read load across more drives. In the cloud the per-node drive count is fixed by the instance type, so this capacity grows by node count.

The network ceiling is also per instance. EC2 sets a bandwidth allowance by instance type and size; on smaller sizes the documented figure is a burst ceiling above a lower baseline. Past the allowance, packets queue or drop. For a database node, the allowance bounds read throughput the same way the drive ceiling does.

The signal to watch is read latency and device-overload. On NVMe the utilization percentage is unreliable for this: it can report a drive fully busy while throughput remains available.

A region edge cluster receives its data through active-passive XDR replication from a source cluster, the cluster in our existing infrastructure where writes land. XDR is asynchronous, so a region edge cluster trails the source by the replication lag, and the data it serves can be slightly behind. We accepted that tradeoff.

We aimed to run the same architecture in the local zones: the disk-based, two-node edge cluster fed by XDR. The hardware was the constraint. A local zone offers a subset of the instance types a full region does, and the storage-optimized instances this setup needs were missing from the local zones we used, with the available higher-capacity options costing three to five times the regional price.

So the local edge clusters run memory-only, with no XDR. Serving from a memory-only store changed how these clusters are populated and read, so the API runs a different path for the local-zone case. That path is the subject of the next section.

Section 10 — the local-zone data source

The implementation

The local edge cluster runs memory-only, with no XDR.

It is a hot cache. The data that the regions serve from the main data source is reached here through an extension: an edge cluster in the nearest region with the hardware for the hybrid setup, holding that data. The other data sources from Sections 7 and 8 operate unchanged.

The data path reuses the background-fill pattern from Section 7:

  • A read misses in the local edge cluster.
  • A background fill reads the record from the extension and writes it to the local edge cluster.
  • The key recurs, and a later request finds the record in the local edge cluster.

The cache is short-lived: records carry a TTL and are refreshed or extended on access.

The rationale

Operating cost drove the choice.

Section 9’s hybrid memory architecture needs storage-optimized hardware. The local zones did not offer the instance the regions ran. The available storage-optimized sizes were larger and more potent, underutilized at this load. A local-NVMe instance also costs more than a memory-only instance of the same compute and memory.

The chosen setup runs memory-only instances for the local edge cluster. In our case the extension is a region edge cluster we already ran, so the only added cost was the local in-memory nodes — less than the larger storage-optimized instances a local-zone cluster would have required.

Which records a metro requests in a day depends on the region, the weekday, the month, the season, and other factors — on the order of 5 to 10 percent of the set. The hot cache holds that active fraction, which fits in memory where the full set would not. The fraction varies with those factors, so the in-memory nodes start from a high baseline and scale vertically as running memory usage rises or falls. The TTL is the eviction mechanism: a record not accessed within it expires, so only the active fraction stays resident.

From the local zone, the extension is a slower data source, slower mainly because of the distance: it sits in another metro, and the cross-region path carries the latency of Section 2. The Section 7 fill logic applies without change: a miss in the local edge cluster schedules a background read from it and writes the record back. The region’s ready-to-serve set became one more slower source behind the existing mechanism.

The cost is lower yield. The local edge cluster returns a record only after the key has been filled and while it remains resident, so more lookups return nothing than in a region. A lower match rate can mean lost revenue or missed revenue opportunity. The tradeoff was acceptable here: it was not yet clear which was higher, the hardware cost of the full setup or the revenue missed by the lower match rate.

Section 11 — the reconciliation

The end-to-end p95 is measured on the partner’s side, and the breaker from Section 1 enforces it: past the limit on enough responses, the request rate to us is cut. The system runs at the contracted rate continuously. Breaker opens are extremely rare and short, on the order of a minute. Sustained flow at the contracted rate is the 5ms holding at p95.

The direct read on our side is the app-side share, published in Section 8, from the local-zone deployments. The figure varies by region, hardware, day, and hour; the per-deployment breakdown stays out of this article.


Settings appendix

Inline scheduling (Section 4) — two switches, set together:

  • UnsafePreferInlineScheduling — Kestrel socket transport option. Inlines the application and transport continuations on the engine thread.
  • DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1 — runtime environment variable. Inlines the socket completions at the runtime layer and raises the default engine-thread count to the processor count.

Engine thread count (Sections 4, 8):

  • DOTNET_SYSTEM_NET_SOCKETS_THREAD_COUNT=<value> — overrides the engine-thread count. Set from testing.

Thread pool spin limit:

  • DOTNET_ThreadPool_UnfairSemaphoreSpinLimit=0 — disables thread pool worker spin-waiting; an idle worker blocks immediately. Default is 70.