Express course · No. 09
Scaling isn't making everything faster — it's finding the one part that buckles under load and widening it, then the next. Most systems need far less of this than their builders fear; the skill is knowing what to reach for, and when.
Essence only · One picture per idea · Measure over guess
Scaling isn't a single switch you flip — it's the ongoing job of handling more load without falling over. It starts with one idea: find the part that breaks first.
Scaling means handling more without breaking
A small café runs fine until a tour bus pulls up — suddenly the one barista, the one till, the few tables can't keep up. Scaling is preparing for the bus.
To scale is to handle growth — more users, more requests, more data — while staying fast and up. It isn't about peak cleverness; it's about the system not buckling when load multiplies. And almost every scaling problem reduces to one thing: something becomes the part that can't keep up.
Up or out: vertical vs horizontal
To carry more, you can buy a bigger truck — only so big — or buy more trucks: nearly no limit, but now you need a dispatcher.
Vertical scaling is a bigger machine — more CPU, more RAM. Easy, but it has a hard ceiling and stays a single point of failure. Horizontal scaling is more machines working together — nearly unlimited, but it needs coordination (load balancing, shared state). Vertical buys you time; horizontal is where real scale lives.
Everything is a bottleneck hunt
A highway is only as fast as its narrowest stretch — widening every other lane does nothing until you fix the chokepoint.
A system is only as fast as its slowest part. Throwing resources at the wrong place is wasted money; the work is to find the actual bottleneck — the database, a slow service, the network — widen it, then find the next one. Scaling is a sequence of bottlenecks, not a single fix.
Most systems never need much of this
You don't build a six-lane motorway to a cabin in the woods. One good road carries everyone who'll ever visit.
A single decent server plus a database serves a startling number of users. Premature scaling — sharding, microservices, elaborate caching before you have the load — is cost and complexity you don't need. Measure first; scale the thing that's actually hurting, when it actually hurts.
Scaling isn't making everything faster. It's finding the bottleneck, widening it, and repeating.
The first real moves up the scaling ladder: outgrow one big box, then make the app able to run as many identical boxes as you like.
Vertical first: the bigger box
When the kitchen's overwhelmed, the first fix is a bigger stove and more counter — not a second restaurant.
The simplest scaling is a bigger machine. It needs no code changes and buys real headroom fast — often enough for a long time. But it has a hard ceiling (you can't buy an infinite server), and that one box stays a single point of failure. Use it to buy time, not as the destination.
Horizontal: many boxes, one job
One checkout becomes ten — now ten customers are served at once, and if one register jams, the other nine carry on.
Real scale means running many copies of your app and splitting work across them. It's nearly unbounded and survives the loss of any one machine. The catch is coordination: something must spread the requests (next chapter), and the copies must not each hold their own private state — which is the next, crucial idea.
Stateless servers: any box can serve any request
A call centre where any agent can pick up any call, because all the customer's details live in the shared system — not in one agent's head.
For horizontal scaling to work, the app servers must be stateless: they keep nothing about a user between requests in their own memory. Session data, uploads, progress — all of it lives somewhere shared (a database, a cache, a token the client carries). Then any request can hit any server, and adding or losing a box changes nothing. Statelessness is what makes "just add more servers" actually work.
Vertical buys time; horizontal buys scale. And horizontal only works if your servers remember nothing on their own.
Once you have many servers, something has to stand in front and decide who handles each request. That's the load balancer — the traffic cop of a scaled system.
The load balancer: one front door, many servers
A queue manager at a bank directing each customer to whichever teller is free — so no one teller is swamped while others sit idle.
A load balancer (Nginx, HAProxy, a cloud balancer) is the single entry point that spreads incoming requests across your pool of servers by simple rules — round-robin, or send to the least-busy one. Clients talk to it, not to individual servers, so you can add or remove machines behind it invisibly. It's what turns a pile of servers into one system.
Health checks: stop sending to the dead
The queue manager notices a teller has stepped away and simply stops sending people to that window — no one waits at an empty desk.
A load balancer constantly checks which servers are healthy and routes around the ones that aren't. A crashed or overloaded box is pulled from rotation automatically, and traffic flows to the rest. This is how a scaled system survives a single machine dying without users ever noticing.
Free wins: zero-downtime deploys and failover
You can renovate one checkout lane at a time while the others keep serving — the shop never closes.
Once traffic flows through a balancer over interchangeable servers, powerful things come almost for free: deploy with no downtime (update servers a few at a time), fail over when one dies, and roll back fast. The balancer plus stateless servers is the backbone that makes a system both scalable and resilient.
A load balancer turns many servers into one address — and a dying machine into a non-event.
The fastest work is the work you don't do. Before scaling the machines that do the work, cut how much work there is — with caches and a CDN.
Caching: don't compute the same answer twice
A chef who preps the popular sauce once in the morning, instead of making it from scratch for every single order.
A cache stores the result of expensive work — a database query, a computed page, an API call — so the next request gets it instantly instead of redoing it. A fast in-memory store (Redis, Memcached) in front of the database can absorb the bulk of reads. Often a cache, not more servers, is what your overloaded database actually needs.
The CDN: serve users from nearby
A global chain that stocks the same goods in local warehouses everywhere — so customers get them from down the road, not shipped across an ocean.
A CDN keeps copies of your static content — images, scripts, video, increasingly whole pages — in data centres around the world, serving each user from the closest one. It slashes latency and takes huge load off your origin. For anything global and static, the CDN does the heavy lifting before your servers ever see the request.
The hard part is invalidation
The calendar on the fridge is faster than checking your phone — until someone changes the plan on the phone, and now the fridge is lying.
A cache is a second copy of the truth, and a copy can go stale. The genuinely hard question isn't adding a cache — it's knowing when to throw it away so users don't see old data. Cache what's read far more than it's written, set sane expiries, and always know what happens when the cache is wrong. (The same lesson as a database cache.)
The cheapest request is the one you never serve. Cache the hot stuff, ship the static from the edge, and sweat the invalidation.
Sooner or later the database becomes the bottleneck — it's the one part that's genuinely hard to scale, because it holds the truth everyone shares.
The database is usually the wall
Ten checkouts all reaching into the same single stockroom — the cashiers scale, but the one stockroom door becomes the jam.
You can run a hundred stateless app servers, but they usually share one database — and that becomes the limit. Unlike app servers, you can't just clone it, because every copy has to agree on the data. So scaling the database is the hardest, most careful part of scaling a system — and the one to put off with caching for as long as you can.
Read replicas: copy for reads
Handing out photocopies of the reference book so many people can read at once, while the single master copy is the only one anyone writes in.
Most apps read far more than they write. Read replicas are copies of the database that serve read queries, spreading that load across machines while writes still go to one primary. It's the first and easiest database scaling move — but copies lag slightly behind the primary, so a just-written value may not appear instantly on a replica.
Sharding: split the writes
One overflowing ledger becomes several — A–M in one book, N–Z in another — so two clerks can write at once without fighting over a single page.
When even writes outgrow a single machine, you shard: split the data across several databases by some key, so each handles a slice. It unlocks huge scale but is genuinely hard — cross-shard queries get painful, and the shard key is a near-permanent choice. This is the deep end; reach it last, when replicas and caching truly aren't enough.
App servers clone easily; the database doesn't. Cache and replicate for a long time before you ever shard.
Not all work has to happen while the user waits. Pushing slow work off the request, and putting buffers between parts, is how a system stays fast and absorbs spikes.
Move slow work off the request path
At a restaurant they take your order and you sit down — they don't make you stand at the counter until the meal is cooked.
When a request triggers something slow — sending email, generating a report, processing an upload — don't make the user wait for it. Hand it to a background queue and return immediately; a worker does the slow part later. The page stays snappy, and the heavy lifting happens off to the side. (The messaging idea from the protocols and architecture courses.)
Queues absorb spikes
A reservoir between a flooding river and the town — the surge fills the reservoir instead of drowning the streets, and drains out at a steady rate.
A queue is also a buffer. When traffic suddenly spikes — a launch, a viral moment — work piles up in the queue and the workers drain it at their own pace, instead of a flood crashing the system. Without that buffer, a surge that arrives faster than you can process it takes everything down. The queue turns a deadly spike into a manageable backlog.
Decouple so one slow part doesn't sink the rest
Watertight compartments in a ship — if one floods, the bulkheads keep the others dry and the ship afloat.
When services talk through queues and events instead of direct, blocking calls, a slow or failed component doesn't freeze everyone waiting on it. The order still gets taken even if the email service is down; it just catches up later. Decoupling is what lets a big system degrade gracefully under stress instead of failing all at once.
Make the user wait only for what they need now. Everything else goes behind a queue — which also saves you when the spike hits.
Scaling well is mostly restraint and order: measure the real bottleneck, climb the cheap rungs first, and prepare for the failures that more machines bring.
Measure before you scale
A doctor runs tests before operating — cutting into the wrong organ because you guessed helps no one.
Before adding anything, find the real bottleneck with metrics and profiling — is it the database, a slow query, the network, CPU? Engineers routinely optimise the wrong thing because it merely felt slow. The data tells you where the system actually hurts; scale that, and nothing else. (This is where observability earns its keep.)
Climb the cheap ladder in order
You add insulation and seal the windows before you buy a second furnace — the cheap fixes first.
There's an order, cheapest and easiest first: a bigger box, then caching and a CDN, then stateless servers behind a load balancer scaling horizontally, then read replicas, then queues for slow work — and only at the far end, sharding. Each rung is more work and more complexity. Climb only as far as your load forces you.
More machines means more failure
One light bulb rarely fails on a given night; in a stadium of ten thousand, several are always out. At scale, something is always broken.
As you add machines, the chance that something is failing at any moment heads toward certain. So scaling and reliability are the same project: redundancy (no single point of failure), health checks and failover, and graceful degradation so a broken part sheds quality instead of killing the system. Design for failure, because at scale it's constant.
- Have I measured where the real bottleneck is — not guessed? - Is there a cheaper rung — a bigger box, a cache, a CDN — before this one? - Are my app servers stateless, so I can add more? - Is the database load reads (replicas, cache) or writes (the hard road)? - Can slow work move behind a queue? - Do I actually have this load — or am I building for a future that may not come?
- Sharding a database a single box could easily hold. - Microservices and queues for an app with a hundred users. - Scaling the part that wasn't the bottleneck, because it felt slow.
- A cache layer so tangled nobody's sure when it's stale. - Building for millions of users you don't have, and a launch that hasn't happened.
- You scaled the measured bottleneck, and the next one is now visible. - Servers are stateless; adding or losing one is a non-event. - Reads are cached and replicated; the database breathes. - Slow work runs behind queues, and a spike becomes a backlog, not a crash.
- There's no single point of failure — any one machine can die quietly. - You climbed only as far up the ladder as the load required.
Scale the bottleneck you measured, with the cheapest tool that fixes it, and assume that at scale something is always broken. Everything else is premature.