ARCHITECTURE · June 15, 2026

How I put 10,000 players in one world

Most online games hide their scale — they split players into rooms of 20 or shards of a few hundred. For Helix Empire I set a harder target on purpose: 10,000 players in a single shared world, on one server, live in the browser. This is the full story of how that gets built — the four walls you hit, why the real bottleneck is traffic and not CPU, and the moment a load test proved my pretty number was a lie. It's long, it's technical, and every claim ends in a measurement. The lessons transfer to any high-load system.

I build Helix Empire — a real-time, browser-based space strategy game whose core is genetics: you breed castes of creatures by editing their DNA, and those genes ripple out into your economy, your army, and your science. But this post isn't really about the game. It's about the engineering problem underneath it, the one I find genuinely hard and genuinely interesting: putting 10,000 players in a single shared world, on one server, updating live in every browser.

Most games never attempt this. They hide their scale — 20-person rooms, or "shards" of a few hundred that never see each other. I set the harder target on purpose, because that constraint is where the real architecture lives, and because I wanted to prove I could. This is the whole story, told plainly, and it ends — like all honest engineering — in measured numbers, not promises. Even if you never build a game, the shape of this problem shows up in any system that has to push live updates to a lot of people at once.

Say it out loud and four walls appear

"Ten thousand players in one world" is one sentence. The moment you try to build it, four walls appear that a naive design runs straight into.

Wall 1: telling everyone about everyone is O(N²). If every player must know what every other player is doing, then one update cycle is every player times every other player:

10,000 watchers × 10,000 objects = 100,000,000 pairs — per tick

A hundred million pairs, several times a second. That's not "slow," it's physically impossible. And it's quadratic: double the players and you quadruple the work. Any design that broadcasts everyone-to-everyone is dead on arrival.

Wall 2: the dominant cost is traffic, not CPU. This is the unintuitive one, and it's the one that decides everything. On cloud hosts, outbound traffic — egress — is the expensive resource, tens of times pricier than on bare-metal servers. If every tick ships a lot of bytes to every player, the bandwidth bill bankrupts you long before the processor breaks a sweat. So the system has to be designed around minimizing what goes out on the wire, not around raw compute speed. That single realization reorders every other decision.

Wall 3: one world wants to live on one server. Spread a single world across several machines and you've signed up for distributed consensus — the servers constantly arguing about whose copy of the world is the real one. Slow, and brutally complex. So I chose single-writer: each world has exactly one process allowed to change it. No races, no consensus. It has a nasty trap hiding in it, which I'll get to.

Wall 4: a thin client can't keep up. If the server computes everything and the browser just draws pixels, the server becomes the bottleneck for 10,000 people at once. So the browser client has to be thick — smart enough to rebuild most of the picture itself from a tiny bit of data.

Four walls. Each one kills the obvious solution. The architecture is just the set of answers to these four, and the discipline is letting the constraints — not the trendiest tech — pick the tools.

The stack, chosen against the walls

People love to pick technology by fashion. The actual job is the opposite: name your constraints first, then choose the smallest set of tools that answer them. Here's the mapping, because the mapping is the thinking.

The simulation runs in Rust, compiled two ways: a native binary for the server, and WebAssembly for the browser. Same code, both sides. That matters more than it sounds. Because the client runs the identical simulation, it can rebuild the world from compact "seeds" and even predict ahead — which moves work off the server (walls 2 and 4). And Rust has predictable memory with no garbage-collector pauses, so one server holds more players.

Realtime frames travel over WebTransport / QUIC — a fast binary stream over UDP that dodges the stalls TCP suffers under packet loss (wall 2), with a WebSocket fallback. Each world is single-writer with event sourcing: one process mutates it, and its history is a stream of events you can replay to rebuild or audit state (wall 3). And it's hosted on bare metal (Hetzner), where egress is 20–40× cheaper than on the hyperscalers (wall 2, again — notice how often wall 2 shows up).

Every one of those is in service of one conclusion: the bottleneck is egress and update fan-out. Get that call right and the stack falls out of it. Get it wrong and you'd lovingly optimize the CPU while the bandwidth bill quietly kills you.

Build it in layers, measure every one

I didn't build this in one heroic push. I went layer by layer, and I measured at every step — because the cardinal rule is that you cannot optimize what you have not measured. Here's the part that surprises people: I started by measuring how bad it was.

The baseline. Before touching anything, I wrote down the embarrassing truth: reading game state spiked to 4–10 seconds, and updates pushed to the browser at roughly one frame every 20 seconds. Practically frozen. But now I had a number to beat, which is the only thing that turns "it feels slow" into engineering.

Killing the single-writer trap. Remember the trap I promised? Single-writer means one process changes the world. But while that process applies a tick — and a tick writes to storage, which is slow — every reader is stuck waiting on the same lock. A player opens their defense screen and hangs for ten seconds, not because reading is hard, but because the server happens to be mid-save.

The fix is the most transferable idea in this whole post: separate reading from writing. I built lock-free caches in memory — ready-to-serve projections of the world, updated incrementally, that readers hit without taking the write lock. The writer still solely owns changes (so consistency is never in question), but readers stopped waiting for it.

0.002s

defense read — was 0.002–4.4s with ugly spikes

~0.006s

events feed — was ~0.2s

sub-second

every endpoint, stably, once reads left the lock

That read/write split is not a game trick. It's the same move behind read replicas, CQRS, and materialized views everywhere — and it's the one I'd reach for first in almost any system that's slow under load.

Sending the difference, not the world. Even with instant reads, each tick still shipped the entire state — re-transmitting piles of things that hadn't changed. Pure waste on wall 2. So I switched to deltas: send only what changed since the last frame. The delta type is small and mathematically checked — one function builds the difference between two states, another replays it, and a convergence test proves that a long chain of deltas, folded by the client, reproduces the server's snapshot exactly. Because client and server share the same Rust code, they cannot drift apart. If a frame is ever lost, the client notices and asks for a fresh full snapshot.

Cutting O(N²) down to a line. This is the answer to wall 1. A player does not need to know about all 10,000 others — only their neighbors in space and diplomacy. So each delta is clipped to a player's Area of Interest: you receive the changes relevant to you, while shared context like chat and trades still gets through. That turns a quadratic broadcast into a linear one — traffic grows with N times the size of your interest area, not with N². Without this step, 10,000 is impossible on paper. With it, the math closes.

The moment the pretty number turned out to be a lie

Here's the part I'm most proud of, and it's a part where I was wrong.

I had a load test, and it reported a lovely 320 Mbit/s at 10k players — comfortably under budget. I almost believed it. Then I looked at where that number came from and found it rested on a hardcoded guess: "assume 32 bytes per player update." Not a measurement. An assumption someone (me) had typed in.

So I plugged in the real binary encoder and measured an actual update. It was 104 bytes, not 32. Why? Because I was shipping a player's entire profile — eleven resource fields — every time, even when only one of them changed. 104 versus 32 is 3.25× worse, which blew the projection out to about 1.04 Gbit/s at 10k. The honest answer, in that moment, was the uncomfortable one: "no — on the real wire format, I do not hold 10,000."

That's the moment that separates an architect from someone who ships a nice demo. The easy path is to keep quoting 320. The right path is to believe the measurement, say the number's bad out loud, and fix the actual thing. The fix was a per-field delta: instead of the whole profile, send an id, a tiny bitmask of which fields changed, and only those fields' values.

player_id (4 bytes) + changed-field bitmask (2 bytes) + only the changed values

# if population, food and science changed:
4 + 2 + 4 + 8 + 8 = 26 bytes   (instead of 104)

Twenty-six bytes instead of a hundred and four. The mask is sixteen bits, one per field: if the bit is set, its value follows on the wire; if not, the field didn't change and costs nothing at all.

Proof, not promises

An architecture without proof is just a confident story. So here's the proof, measured.

Underneath are the boring guarantees: round-trip tests (encode a delta to bytes, decode it back, lossless), convergence tests (the client never drifts from the server), contract tests on every boundary, and 42 end-to-end tests in a real Chromium browser over WebTransport — including the ones that actually matter for live play, like "receives server ticks over a realtime socket with no polling" and "two players in one world stay in sync in real time."

And the headline: a 10,000-bot swarm test that measures egress with the real binary codec — no hardcoded constants this time. It spins up 10,000 bot clients in one world, runs a real authoritative tick, applies each bot's visible updates through the real Area-of-Interest filter, and adds up the actual measured bytes.

26 bytes

typical per-player update, down from 104

237.6 Mbit/s

egress at 10,000 players × 5 ticks/s

24%

of the 1 Gbit/s budget — about 76% headroom

≤ 200 ms

time to fan a tick out to all 10k

And the number is kept honest: a gate called check:load re-verifies the report on every run, and the projection is derived from the measured value rather than an assumption. If some future change accidentally fattens the update format, the gate fails and the regression can't sneak through. The guard exists precisely because I'd already been fooled once by a number I didn't measure.

The honest boundary

Maturity isn't just getting a good number — it's being clear about what the number does and doesn't cover. So, plainly:

What's proven. By compute and latency, fanning a tick out to 10,000 bots stays under 200 ms with lots of room. By traffic, 237.6 Mbit/s at 10k — 24% of a 1 Gbit/s budget — and that's a real measurement through the real codec, not a hopeful guess.

What's not proven yet. This is a projection from one measured tick at 10k, with bots inside one process — not a live cluster of 10,000 real QUIC sockets hammered for hours. Sustained CPU and memory under a long multi-minute stream, and the per-player update channel for the thick browser client, are the next layer of work. I'd rather say that than oversell it. The thing that matters is that the biggest risk — egress, the one I'd flagged myself as over budget on the real format — is closed and measured.

What transfers, even if you never build a game

Strip away the spaceships and the genetics, and what's left is a set of moves I'd use on almost any system that has to serve a lot of people at once:

Name the real bottleneck before you optimize anything. Here it was egress, not CPU. Everything followed from that one call. The most expensive mistakes are made optimizing the wrong resource beautifully.
Separate reads from writes. A single writer for consistency, plus lock-free read projections, killed the contention without giving up correctness. This one's nearly universal.
Send diffs, not whole states — and if you can, share one codebase across the boundary so the two sides physically can't disagree.
Find the O(N²) and cut it to linear. Almost every "it falls over at scale" story has a hidden quadratic in it. Mine was broadcast; Area of Interest was the knife.
Never trust a pretty number you didn't measure. My 320 Mbit/s was a typed-in assumption that was 3.25× wrong. The whole result hinged on catching that and being willing to say it.
Lock the win behind a gate so a future regression trips an alarm instead of shipping silently.

The bottom line

Ten thousand players in one live world is the kind of target that sounds like bravado until you break it into four walls and answer them one at a time, measuring as you go. The engine does it on 24% of its traffic budget, and I can show you the test that proves it rather than ask you to take my word.

The architecture isn't a clever trick — it's a chain of decisions, each one answering a specific wall, each one backed by a number. That's the whole discipline: name the real constraint, build to it in measured layers, and trust the measurement over the story you wanted to tell. Helix Empire goes live soon at helixempire.com — and the full engineering writeup lives in the case study.

Comments

No comments yet

Be the first to share a thought.