Express course · No. 06
Languages and frameworks come and go. The data you store, and the shape you store it in, can outlive all of them — so choosing a database is choosing how your system remembers, and how hard it is to change its mind.
Essence only · One picture per store · Examples over benchmarks
A database is your system's long-term memory. Code gets rewritten; the data, and the shape you chose for it, tends to outlive everything around it.
The data outlives the code
A house gets repainted, refurnished, even re-roofed over decades — but the foundation poured on day one stays, because replacing it means lifting the whole house.
Languages change, frameworks get swapped, the UI is redrawn ten times. The data and its structure usually stay, because changing them on a live system full of real records is slow and risky. So the database is the first thing to think hard about, and the last thing to rush.
A database answers two questions
Choosing where to keep things asks two things: what shape are the things — books on shelves, or odd tools on pegs? — and how safe must they be — a vault, or a handy drawer?
Every store is defined by how it shapes data (rigid tables, flexible documents, a graph of links) and what it guarantees (strict correctness, or speed and availability with looser promises). Almost every database choice is a different answer to those two — and the rest of this course is the menu.
Pick by how you read and write, not by hype
A filing cabinet, a card catalogue, and a conveyor belt are all "storage" — but you choose by how you'll fetch things, not by which looks newest.
There is no best database. A store that's perfect for fast key lookups is wrong for deep relationship queries; one built for huge writes is wrong for strict bank transactions. The right question is your access pattern — what you read, what you write, how often, and how it must behave when things go wrong.
Code is easy to change. Data is not. Choose the store like a decision you'll live with for years — because you will.
For most systems the default is the relational database — and it earns that default. Start here, and leave only with a reason.
Tables, rows, and relations
A spreadsheet done right: each sheet is one kind of thing (customers, orders), each row one record, and orders point back to the customer they belong to.
Relational databases (Postgres, MySQL) store data in tables with a fixed schema, and link tables by reference — an order row holds the id of its customer. The schema is a promise about shape: every order has the columns it should, checked on the way in. Structure up front, in exchange for sanity later.
ACID: all-or-nothing correctness
A money transfer is two writes — minus here, plus there — but one intent. Either both happen or neither does; a half-done transfer is a disaster.
Relational databases give transactions with ACID guarantees: a group of changes either fully commits or fully rolls back, isolated from others, and survives a crash once confirmed. This is why banks, orders, and inventory live in SQL — when correctness matters more than raw speed, you want this promise.
SQL: ask for what you want, not how to get it
You tell a librarian "every novel by French authors published after 1950, sorted by year" — and they work out how to find them. You describe the result, not the search.
SQL is a declarative query language: you state the answer you want and the database plans how to fetch it, including joining tables together. SELECT ... JOIN ... WHERE ... can answer questions you never anticipated at design time — one of the relational model's quiet superpowers.
Why it's the right default
A good general-purpose tool: not the fastest at any single job, but the one you reach for until a job specifically demands otherwise.
Structured data, real relationships, ad-hoc queries, strong guarantees — the relational database does all of it well, and modern Postgres also handles JSON, full-text search, and even vectors. Begin every project here; reach for something else only when a concrete pain — scale, shape, or access pattern — pushes you out.
Start with Postgres. The question isn't "why SQL" — it's whether you have a real reason not to.
Sometimes the rigid table is the wrong shape. "NoSQL" is really a family of stores, each bending the rules in a different direction for a different job.
Document stores: a JSON blob per thing
A folder of filled-in forms, where each form can have slightly different fields — no central template forcing every one to match.
Document databases (MongoDB) store records as flexible JSON-like documents, nested and schema-light, so two records in one collection can differ. Great when shapes vary or evolve fast, and a whole thing — a product, a profile — lives as one document you fetch in a single read. The cost: fewer guarantees and harder cross-document queries than SQL.
Key-value stores: a giant hash map
A coat check — you hand over a ticket number, you get back exactly your coat. No searching, no questions, instant.
Key-value stores (Redis, DynamoDB) are the simplest shape: a key in, a value out, blazingly fast. Perfect for caching, sessions, rate-limit counters, and anything you fetch by a known id. The trade: you can only look things up by key — there's no rich querying over the values themselves.
Wide-column stores: built for enormous writes
A vast warehouse with endless identical aisles, arranged so a thousand forklifts can stock shelves at once without bumping into each other.
Wide-column stores (Cassandra) spread data across many machines for massive write throughput and scale, trading rich queries and strict consistency for the ability to absorb a firehose. Used for event logs, telemetry, and feeds at the scale where a single SQL box would buckle. Powerful — and more operational work to run well.
Graph databases: when the links are the point
A detective's pinboard — photos connected by string. The value isn't in any one photo; it's in the web of who's tied to whom.
Graph databases (Neo4j) model data as nodes and the edges between them, making relationship queries — "friends of friends who like this," "the path between these two accounts" — natural and fast, where SQL would need painful many-way joins. The home of social graphs, recommendations, and fraud detection. Reach for it when the connections matter more than the records.
NoSQL isn't "newer SQL." Each store drops a guarantee to win something specific — know which one you're trading away.
Beyond the general stores sit databases built for one job, where a purpose-built engine beats a general one by a mile.
Time-series: data stamped by the clock
A heart-rate monitor's tape — an endless strip of readings, each tied to a moment, where you almost always ask "what happened between these two times?"
Time-series databases (InfluxDB, TimescaleDB) are tuned for data that arrives in time order and is queried by time — metrics, sensor readings, prices. They ingest huge streams and answer "average per minute over the last week" fast, with built-in expiry of old data. When everything you store has a timestamp and time is the main axis, this is the fit.
Search engines: full-text and relevance
A search box that forgives typos, ranks the best matches first, and finds "running" when you typed "run" — like a librarian who knows what you meant.
Search engines (Elasticsearch, OpenSearch) index text for fast full-text search with ranking, fuzzy matching, and filters — what a LIKE '%word%' in SQL does slowly and badly. They usually run alongside your main database, kept in sync, powering the search bar and log exploration. Not your source of truth — a fast lens onto it.
Vector databases: search by meaning
Instead of matching exact words, you ask "find me things that feel like this one" — and it returns neighbours by similarity, not by keyword.
Vector databases (pgvector, Pinecone) store embeddings — numeric fingerprints of meaning — and find the nearest ones, so you can search by similarity rather than exact text. This is the engine under semantic search and RAG, where an AI app retrieves the most relevant chunks to answer from. The new essential store of the AI era.
Object storage: where the big files go
A self-storage unit — not for index cards, but for the furniture: bulky things you keep by label and haul out whole.
Object storage (S3 and its kin) isn't a database, but it's where the large blobs live — images, video, backups, documents — addressed by a key and served cheaply at scale. The pattern is almost universal: keep the file in object storage, keep a row pointing to it in your database. Records in the DB, payloads in the bucket.
Don't bend your main database into a search engine or a file server. Put the specialised job in the store built for it.
A database's deepest choice isn't its shape — it's what it promises when things go wrong. A little alphabet soup names that trade.
ACID: pessimistic, correct, careful
A bank teller who double-checks every entry and won't move on until the books balance — slower, but never wrong.
ACID (Atomicity, Consistency, Isolation, Durability) is the strict promise of relational databases: every transaction leaves the data correct and complete, or doesn't happen at all. You pay in some speed, and in harder scaling across machines. For money, orders, and anything where a wrong number is unacceptable, this is the trade you want.
BASE: optimistic, available, eventually right
A busy shop that lets everyone keep buying even while the price tags are being updated — the numbers catch up in a moment, and that's fine.
Many NoSQL systems take the opposite stance: BASE (Basically Available, Soft-state, Eventually consistent). They stay fast and available and accept that copies of the data may briefly disagree before converging. Perfect when availability beats instant precision — a like count, a feed, a view counter — and a second of staleness costs nothing.
CAP: pick two when the network breaks
A team split by a downed phone line: each half can keep working alone (but they'll diverge), or down tools until reconnected (but you stop). You can't have both.
The CAP theorem says that when a network partition splits your database's machines, you must choose Consistency (refuse writes, stay correct) or Availability (keep serving, risk disagreement) — you can't have both during the split. Every distributed database makes this call. Knowing which your store picks tells you how it behaves on its worst day.
ACID for money, BASE for likes. The guarantee you need is set by the cost of being briefly wrong.
A handful of techniques turn a database from a single fragile box into something fast and survivable. You'll meet all of them.
Indexes: the book's index, for data
To find a word in a 900-page book, you don't read every page — you flip to the index at the back and jump straight there.
An index is a side structure that turns a full-table scan into a direct lookup — the single biggest lever on read speed. The catch: each index costs storage and slows writes (every insert has to update it too). So you index the columns you actually filter and sort by — not every column, just in case.
Replication: keep copies, survive failure
A document kept in three safes in three buildings — if one burns down, the others still hold it, and three people can read at once.
Replication keeps copies of the data on several machines: if the leader dies, a follower takes over, and reads can spread across replicas. It buys reliability and read-scaling. The catch is replication lag — a follower can be a moment behind, so a value you just wrote might not show up on a replica yet.
Sharding: split when one box isn't enough
One overflowing filing cabinet becomes A–M in one cabinet and N–Z in another. More room — but now you must know which cabinet a name lives in.
When the data or write load outgrows a single machine, sharding splits the data across many — users A–M here, N–Z there. It unlocks near-unlimited scale, but the shard key is a hard, near-permanent choice: pick badly and you get hotspots, or queries that have to hit every shard. Replication copies the same data; sharding splits different data.
Normalise, then denormalise on purpose
One master address list everyone refers to (no contradictions) versus stapling a copy of the address onto every order (faster to read, but now there are copies to keep honest).
Normalisation keeps each fact in exactly one place — clean, no contradictions, the relational default. Denormalisation deliberately duplicates data so reads are faster and need fewer joins. Both are valid; the rule is to do it consciously, and to always know which copy is the source of truth and which is just a fast duplicate. (Same lesson as a cache.)
Add an index before a server, a replica before a shard. The cheapest scaling is the kind you don't have to operate.
You don't pick one database and force everything into it. Real systems use a few, each for the job it's best at.
Polyglot persistence: several stores, on purpose
A kitchen has a fridge, a freezer, and a pantry — not one box forced to do all three. Each keeps what suits it.
Mature systems are usually polyglot: Postgres as the source of truth, Redis for caching and sessions, a search index for the search bar, maybe a vector store for AI features and object storage for files. Each store does what it's good at. The discipline is keeping clear which one holds the truth, with the others as fast copies kept in sync.
Start with Postgres; grow out of it deliberately
A Swiss Army knife handles a startling amount before you ever need the full toolbox.
Modern Postgres is astonishingly capable — relational core, JSON documents, full-text search, vectors, even a workable job queue. Most products should start with just Postgres and add a specialised store only when a real, measured pain demands it. Every extra database is another thing to run, back up, and keep consistent. The cheapest store is the one you didn't add.
- What shape is the data — tables, documents, key-value, a graph? - How will I query it — by id, by relationships, by text, by similarity, by time? - What guarantees do I need — strict ACID, or is eventually consistent fine? - What's the read/write balance and the rough scale? - Could Postgres already do this before I add another store? - Which store holds the source of truth, and which are just copies?
- You're running
LIKE '%...%'for search instead of a search engine. - A key-value store where you constantly need to query by the values. - Five databases for a product two people maintain. - You picked MongoDB to avoid schemas, then rebuilt validation by hand anyway. - No clear source of truth — two stores both claim to be authoritative.
- The query you run most is the one the store is fastest at. - The guarantee matches the stakes — strict for money, relaxed for likes. - You can name the source of truth for every important fact. - You added each store to solve a real, felt pain, not a hypothetical. - You could explain the data model to a new engineer in a few minutes.
The best database decision is usually "Postgres, for now" — plus the wisdom to know exactly when "for now" has ended.