Third time's the charm: building a low-latency realtime database
When I set out to build Lark -- a Firebase Realtime Database alternative -- I had one non-negotiable requirement: predictable, low latency. My goal was to keep internal p99 latency at less than 50ms, and p50 latency at less than 10ms.
That single requirement led me through three distinct architectures before I found one that worked, and taught me several lessons that might prove useful to others building realtime systems in the future.
Quick overview
Lark is designed to do one thing: sync a JSON structure between multiple clients. This is a seemingly simple foundation on which you can build a lot of really cool stuff, from multiplayer games to collaborative web apps. But since the JSON structure is incredibly flexible, and since it needs to support tens (or even hundreds) of thousands of connected clients, once you start really scaling up, you'll quickly find that there are some deep technical challenges to overcome.
Attempt 1: Go and the goroutine dream
Go seemed like the natural choice. My core design principle was isolation: each database should be its own little world, processing writes independently, so that one hot database can't degrade the experience for all the others. Go's goroutine model maps beautifully to this: spin up a goroutine per database, let the runtime schedule them, lean on channels for communication.
The initial build went fast. Go is fantastic for quickly standing up networked services, and within a few working sessions I had a functioning server with security rules, queries, persistence, and a WebSocket transport. Early load tests at modest scale looked great.
Then I wrote a load tester that simulated 'typical' user behaviour and pushed it to 75,000 concurrent connections, and several issues became apparent.
The GC wall
The first problem was Go's garbage collector. Each database maintains a JSON tree in memory, represented as a map[string]*Node structure. A 50MB JSON document expands to roughly 130MB of Go structs, and every map entry, every string key, every *Node pointer is something the GC has to trace.
At scale (thousands of databases, millions of nodes), GC mark phases were taking multiple seconds. During those pauses, every connected client experiences a latency spike. When you're building for realtime, having 20 seconds of smooth sailing punctuated by 1-2 seconds of hiccups just isn't acceptable. The tail latency is the user experience.
I threw everything at the problem. I designed tiered storage with some data in actual *Node structs and others as raw JSON strings to reduce heap size. I drafted an elaborate flat node array proposal to minimize pointer counts. I tuned GOGC and GOMEMLIMIT. I tried pooling zstd encoders, hand-rolling JSON serialization, and went multiple rounds eliminating allocations on every hot path I could find. I lived in the go profiling tools (which are excellent!)
Unfortunately, none of it was enough. The fundamental issue is that Go's GC traces pointers, and a JSON tree is made of pointers. You can reduce the count, but you can't eliminate it. It felt like I was fighting the runtime, and I was losing the battle.
The syscall tax
The GC wasn't my only problem. One of the things I want Lark to be able to do that Firebase RTDB can't is support fast, frequent small updates (think cursor presence or syncing positions in a social VR game). To do that, I want to support UDP-backed transports in addition to TCP. In the browser, I reached for WebTransport (QUIC), which seemed ideal: unreliable datagrams, no head-of-line blocking.
At 25,000 concurrent connections with 10Hz volatile writes, 44% of CPU was spent in syscall.Syscall6 for just the overhead of individual UDP send/receive calls. QUIC's Go implementation couldn't batch small packets (my 21-byte cursor updates were far below the MTU threshold for GSO), so every update was its own syscall.
For comparison, I tested a WebSocket + kernel TLS proxy as an alternative and found it used 3x less CPU for the same workload. TCP batches naturally via TSO, and kernel TLS (kTLS) offloads encryption to the kernel. The user space requirement for QUIC was costing me dearly.
In the end, I was able to patch quic-go to add in support for sendmmsg to help greatly reduce the number of syscalls, but this side quest showed the need to profile your entire stack, including the packages you may be relying on underneath your application code.
Attempt 2: Rust with Tokio
Unfortunately, I came to the conclusion that the GC pauses were simply insurmountable for this use case. No amount of clever data structure work was going to eliminate them; only eliminating the GC itself would. So I made the decision to take what I'd learned thus far, and rewrite the server in Rust. (I also split off the WebSocket and WebTransport termination to a proxy layer that was still written in Go, allowing Rust to be used just for the core realtime database piece, but that's another blog post for the future).
Rust's ownership model gave me exactly what I needed for the data layer. I designed ArcValue, a copy-on-write JSON tree using Arc<T>:
pub enum ArcValue {
Null,
Bool(bool),
Number(f64),
String(Arc<str>),
Object(Arc<BTreeMap<Arc<str>, ArcValue>>),
}
Cloning a tree is O(1) because it's just incrementing a reference count. Mutations use Arc::make_mut() for in-place updates when the refcount is 1, falling back to copy-on-write when the data is shared. No GC pauses, no tracing, deterministic memory management. This data structure survived all three architectures and is still the core of Lark today.
For the async runtime, I chose Tokio, which is the standard choice in the Rust ecosystem. I carried over the same design philosophy: isolate databases, use message passing, let the runtime handle scheduling.
Pretty quickly, I had the full feature set ported, and again began putting it through its paces via load testing. Again, troubles began to emerge under heavy load.
Same movie, different theater
The GC pauses were gone, which was a huge win. But a familiar pattern started showing up: cross-thread contention.
Tokio uses a work-stealing scheduler. Any task can run on any thread, and the runtime moves tasks between threads to keep all cores busy. Great for throughput, but it means any shared state needs synchronization. And in a realtime database, there's more shared state than you'd expect.
When a client subscribes to a path, the subscription state needs to be accessible from whatever thread processes the next write to that path. When a write arrives, the database tree needs to be readable from whatever thread is generating events for subscribers. I was passing messages through channels, but the channels themselves have synchronization costs. Arc<Mutex<T>> and Arc<RwLock<T>> were showing up everywhere.
I tried splitting TCP reads onto separate runtimes. I tried tuning thread counts. I tried a polling approach. Each change either traded one contention point for another or reduced throughput to avoid contention at the cost of latency.
The core issue: work-stealing and data locality are at odds. Work-stealing wants to move work between threads. Data locality wants work to stay near its data. In a realtime database where the hot path is "receive write -> update tree -> generate events -> send to clients," every step in that pipeline needs access to the same data. Spreading it across threads means synchronization. Keeping it on one thread means you can't use work-stealing.
I'd spent weeks fighting this same problem in Go. A few days into the Tokio version, I could see where it was heading.
Attempt 3: Glommio and thread-per-core
I knew that Rust was successfully giving me my GC-free steady performance, but I still needed a different approach to really unlock the full potential of the server. I had read about Glommio previously, a thread-per-core async runtime built on io_uring. The architecture is radically different from both Go's goroutines and Tokio's work-stealing.
Glommio uses one event loop per CPU core, pinned via CPU affinity. There should ideally be no shared mutable state between cores. In Lark, each core has its own HashMap<String, DatabaseHandle>. Deterministic routing via xxhash64(database_id) % nr_cores ensures the same database always runs on the same core.
There is literally no cross-core synchronization in the hot path. A write arrives at the proxy, gets routed to the correct core based on the database ID, and from that point on, everything -- tree mutation, rules evaluation, event generation, client sends -- happens on a single thread with zero contention.
The difference in this approach
Each core's database map is a plain HashMap, not a DashMap or Arc<RwLock<HashMap>>. Subscription state, client state, and the data tree are all owned by a single thread. That saves us from needing mutexes or other atomic locks, which eliminates a whole class of coordination work for the CPU.
Tasks don't migrate between threads, so the CPU cache stays warm. There's no scheduler making decisions about where to run the code. It runs on the core it was assigned to.
There's also an added benefit to Glommio's usage of io_uring that addresses the syscall overhead that was difficult to overcome in the Go version. Instead of one syscall per read/write, you submit batches of I/O operations to the kernel and collect completions asynchronously.
Because there's no contention, no GC, and no work migration, latency is almost entirely determined by the amount of work being done, not by what other databases or other cores are doing. One hot database on core 3 has little impact on the databases running on core 7.
There's no free lunch
Thread-per-core isn't free. You give up the scheduler's ability to automatically balance load. If one core has a hot database doing 10x the work of anything else, that core is busy while others sit idle. You can't borrow capacity from quiet cores.
In theory, the law of large numbers should help keep things balanced. And the consistent routing means I never have cache-cold situations, since the data is always warm on its assigned core. But this is something we'll need to monitor closesly as we scale up to serve larger amounts of 'real' traffic.
The bigger trade-off is ecosystem. Glommio is Linux-only (it requires io_uring), which means developing on macOS requires Docker for compilation and testing. The library ecosystem is smaller than Tokio's. It's committing to a less-traveled path.
For me, the performance characteristics made it an easy choice. But I wouldn't make this recommendation for every project. If your latency budget allows occasional spikes in p99 latency of a few hundred milliseconds, Tokio (or even the original Go-based approach) is a much more practical default
What I'd tell someone starting a similar project
If you find yourself building something latency-sensitive, there are three lessons I learned that are worth sharing:
Latency and throughput are different problems that want different solutions. Work-stealing maximizes throughput, while thread-per-core minimizes latency. If you care about p99 latency in a realtime system, optimizing for throughput may actively work against you.
GC-based languages have a ceiling for this kind of workload. Go's GC is remarkably good, and for most applications it's invisible. But when your heap is full of pointer-rich data structures and your latency budget is measured in milliseconds, you will hit a wall that no amount of tuning can move.
Profile under realistic load as soon as possible. My Go implementation looked great in small tests. The GC and syscall problems only appeared at scale with realistic data sizes. Likewise, my Tokio implementation compiled and passed tests quickly, with the contention patterns only emerging under significant concurrent load.
Conclusions
Lark is running in production now, handling tens of thousands of concurrent connections with sub-50ms p99 write latency. I got here by being wrong (twice!), but each time I learned something valuable that informed the next attempt. Sometimes it really is about the journey rather than the destination.
If you're looking to power your next social game, collaborative web app, or multiplayer mobile experience, check out Lark and give us a try!