Express course · No. 19
When your code runs on your laptop you can pause it and poke around. When it runs in production serving real users, you can't — you only know what's happening from what the system tells you. Observability is the craft of making a system explain itself, through logs, metrics, and traces, so that when something breaks at 3am you can ask questions and get answers instead of guessing.
Essence only · One picture per idea · Learn the words
The whole field exists because of one hard fact: you cannot debug a live production system the way you debug code on your machine. So you have to make the system tell you what it's doing.
You can't attach a debugger to production
A pilot can't climb out and inspect the engine mid-flight — they fly entirely by the instruments on the panel, trusting the dials to tell them what's happening.
On your laptop you pause the code, inspect variables, step through. In production — real servers, real users, right now — you can't stop it to look. You can only learn what's happening from the signals the system emits as it runs. This is the core shift: in production you're flying by instruments, and observability is about having the right instruments on the panel before you need them.
Monitoring answers known questions; observability answers new ones
A car dashboard shows the few things you knew to watch — speed, fuel, temperature. But when a strange new noise starts, you need to be able to investigate something nobody put a dial for.
Monitoring is watching for problems you anticipated — dashboards and alerts for known failure modes. Observability is the deeper property: can you answer questions you didn't foresee, just from the data the system produces? Real outages are usually novel — "why are users in Brazil seeing slow checkouts since 2pm?" — and observability is whether your system emits enough to let you chase a question you never planned for.
A black box is a system you can only guess about
A vending machine that took someone's money and gave nothing, with no display and no receipt — all you can do is shrug and guess, because it tells you nothing.
A system that emits little is a black box: when it misbehaves, you're reduced to guessing and restarting. The cost shows up at the worst moment — a production incident where every minute matters and you have no idea where to look. Observability is paid for in advance, by instrumenting the system while it's calm, so that when it's on fire you have light to see by. The time to add it is before you need it.
You can't pause production to inspect it. Observability is building the system to explain itself — so you can answer questions you never foresaw, not just watch for ones you did.
The oldest and most detailed signal is the log: a running diary of what the system did. It's the first thing you reach for and the easiest to misuse.
A log is the system's diary
A ship's logbook: timestamped entries noting what happened and when — "10:42, entered the harbour" — so anyone can reconstruct the voyage afterward.
A log is a timestamped record of an event: "user 42 logged in," "payment failed: card declined," "started processing order 91." Each line is a small, detailed note about something that happened, written as the code runs. Logs are the richest signal because they carry specifics — exactly what, exactly when, with the details attached. When you need to know what really happened in one case, logs are where you look.
Structured logs are searchable
The difference between a shoebox of handwritten notes and a spreadsheet — both hold the facts, but only one lets you instantly filter to everything about customer 42.
A line of free text is hard to search across millions of entries. Structured logs record each event as data — fields like user_id: 42, status: failed, duration_ms: 230 — so you can query them: "show every failed payment over 500ms today." This turns logs from a wall of text you scroll through into a database you can interrogate. Structure is what makes logs usable at production scale.
Levels separate signal from noise
A diary where routine entries are in pencil and emergencies are in red ink — so in a crisis you can flip straight to the red and ignore the rest.
Every log has a level marking its importance: DEBUG and INFO for routine detail, WARN for something off, ERROR for a real failure. Levels let you turn the volume up when investigating and down in normal operation, and jump straight to the errors in a flood of entries. Used well, levels are how you keep the important lines findable; ignored, every log looks equally urgent and none stands out.
Too much log is as useless as too little
A diary that records every single breath is as unreadable as a blank one — the signal you need is buried in noise nobody can sift.
Logs are tempting to overuse, and excessive logging has a real cost: it's expensive to store, slow to search, and it buries the lines that matter under noise. The skill is logging the things you'll actually want later — decisions, failures, key events — with enough detail to be useful and not so much that the signal drowns. Log on purpose, not reflexively.
Logs are the detailed diary of what happened. Make them structured so you can query, leveled so you can filter, and focused so the signal isn't lost in noise.
Logs tell you about individual events. Metrics tell you about the whole, over time — the numbers that show, at a glance, whether the system is healthy or sliding.
A metric is a number over time
A hospital chart tracking heart rate through the night — not every heartbeat, just the number sampled steadily, so a rising trend jumps out.
A metric is a measurement tracked over time: requests per second, error rate, response time, memory used. Unlike a log, it isn't about one event — it's the aggregate, sampled continuously, so you see trends and spikes. Metrics are cheap to collect and store because they're just numbers, which means you can keep them for everything, all the time, and watch the shape of your system move.
Counters go up; gauges go up and down
An odometer only ever climbs, counting total miles ever driven; a speedometer rises and falls with how fast you're going right now.
Two basic shapes cover most metrics. A counter only increases — total requests served, total errors — and you watch its rate of change. A gauge moves up and down to show a current value — memory in use, active connections, queue length. Knowing which one you're looking at tells you how to read it: a counter's slope is the story; a gauge's current height is.
Dashboards make trends visible
A wall of gauges in a control room: at a glance the operator sees everything is green and steady, or that one dial is climbing toward red.
Metrics are usually viewed on a dashboard — graphs of the key numbers over time, side by side. A good dashboard lets you see the system's health in seconds and spot the moment a line bent the wrong way. This is where metrics earn their keep: not in any single value, but in the visible trend that says "something started going wrong at 2pm" before users finish complaining.
The four golden signals
A doctor's quick vital-signs check — pulse, temperature, blood pressure, breathing — a tiny set that catches most of what's wrong without measuring everything.
You don't need a thousand metrics to start. The four golden signals cover most of a system's health: latency (how long requests take), traffic (how much demand there is), errors (how many requests fail), and saturation (how full your resources are). Watch these four and you'll catch the large majority of problems early. They're the vital signs of a service — the first dials to put on the panel.
Metrics are numbers over time — counters that climb, gauges that swing — shown on dashboards. Start with the four golden signals: latency, traffic, errors, saturation.
In a system of many services, one user request touches many of them. A trace follows that single request all the way through — answering the question logs and metrics can't: where, across everything, did the time or the failure happen?
A trace follows one request across services
A parcel's tracking history: picked up here, sorted there, flown across, delivered — the whole journey of one package across every hand that touched it.
In modern systems a single request hops through many services — the gateway calls the order service, which calls payments, which calls a database. A trace records that entire journey for one request, stitched together so you can see every hop it made. It answers a question neither logs nor metrics can: not "what happened" or "how much," but "what was the path of this specific request through the whole system?"
Spans show where the time went
A travel itinerary broken into legs, each with its own duration — and instantly you can see that the three-hour layover, not the flights, ate the day.
A trace is made of spans — one per step, each timing how long that piece took. Laid out as a waterfall, the spans show exactly where a slow request spent its time: the database span took 900ms while everything else took 50. Without this, "the request was slow" is a mystery across many services; with it, you point straight at the culprit. Spans turn a vague slowness into a precise location.
Tracing is how you debug distributed systems
Following a single dropped baton back through a relay race to find exactly which handoff failed — impossible to see from the final result alone.
When a request fails or crawls and the work is spread across services, a trace is often the only way to find where. Metrics tell you the system is slow; logs tell you what each service said; the trace ties one request's path together and points at the exact service and step that broke. For anything distributed, tracing is the tool that turns "somewhere in there it went wrong" into "right here."
A trace follows one request across every service, broken into timed spans — so you can see exactly where, in a distributed system, the time or the failure actually happened.
Logs, metrics, and traces are called the three pillars of observability because each answers a different question. The power is in using them together, not in picking one.
Each pillar answers a different question
Investigating a problem with three tools: a chart showing when things changed, a map showing where, and a detailed report showing exactly what — each useless alone, decisive together.
The three pillars divide the work. Metrics answer "is something wrong, and what's the trend?" — the cheap, always-on overview. Traces answer "where, across all the services, is the problem?" — narrowing it to a step. Logs answer "what exactly happened there?" — the full detail at the scene. No single one is enough; each picks up where the others stop.
The natural flow: notice, locate, explain
A doctor sees a fever on the chart, examines to find which organ, then runs the specific test that names the illness — broad signal, then narrow, then exact.
In practice you move through them in order. A metric or alert tells you something's wrong (error rate jumped). A trace shows you where (the payment service step is failing). A log at that point tells you exactly what (the payment provider returned a timeout). Broad to narrow to exact: the metric notices, the trace locates, the log explains. Knowing this flow is most of knowing how to debug production.
Correlation ties them together
A case number written on every document, photo, and report — so an investigator can pull everything about one incident together instead of searching each pile separately.
The pillars are far stronger when linked. Attaching a shared request ID (or trace ID) to the metrics, the trace, and every log line for a request lets you jump from "this request failed" straight to its trace and its exact logs. Without correlation you have three separate haystacks; with it, one pull brings the whole story of an incident together. The links between the pillars are what make them a system rather than three tools.
Metrics notice, traces locate, logs explain — broad to narrow to exact. Correlate them with shared request IDs and three tools become one story of what went wrong.
Observability lets you investigate; alerting tells you when to start. The hard part isn't detecting problems — it's deciding which ones are worth waking a human for.
Alert on what users feel
A smoke alarm should go off for an actual fire, not every time someone makes toast — or people rip the battery out and miss the real one.
An alert is an automatic notification when something crosses a line. The art is alerting on symptoms users actually feel — checkouts failing, the site slow — not on every internal twitch. A server at 80% memory isn't a problem if users are fine; a spike in failed orders is, even if every server looks healthy. Alert on the user-facing outcome, and you page humans for things that genuinely matter.
SLI, SLO, SLA: defining good enough
A delivery company promising "99% of parcels arrive within two days" — a clear, measurable bar for what counts as acceptable service, agreed in advance.
Three initials formalise "how good is good enough." An SLI is the measurement (the percentage of requests served under 300ms). An SLO is the target you set for it (99.9% should be). An SLA is a contractual promise to a customer, with consequences if missed. Setting an explicit SLO turns reliability from a vague feeling into a number you can hold the system to — and a clear line for when to act.
Alert fatigue is the silent killer
A car that beeps constantly for trivia trains the driver to ignore all beeps — so the one warning that mattered gets tuned out with the rest.
Too many alerts is its own failure: when people are paged for things that don't matter, they start ignoring alerts entirely, and the real one gets missed in the noise — alert fatigue. Every alert should be actionable and worth a human's attention right now; if it isn't, it belongs on a dashboard, not in a pager. Fewer, sharper alerts beat a flood, because an alert nobody trusts is worse than no alert at all.
Alert on symptoms users feel, define "good enough" with an SLO, and guard against alert fatigue — because an alert no one trusts is worse than none.
Observability is built in, not bolted on. The skill is instrumenting as you go, keeping the signal clean, and spending your attention on the few things that reveal the most.
Instrument as you build, not after the fire
Wiring the smoke detectors while you build the house — not standing in the smoke wishing you had.
The worst time to add observability is during an outage, when you discover the system tells you nothing. So you build it in as you go: emit a metric for the new feature, log its key decisions, make sure a request can be traced through it. A little instrumentation added while the code is fresh saves hours of blind guessing later. Treat "how will I see this in production?" as part of building it, not an afterthought.
Start with the golden signals and grow
A new shop owner first watches the few numbers that matter — sales, footfall, complaints — and adds finer tracking only as specific questions arise.
You don't need total coverage on day one. Start with the four golden signals and a handful of logs at key decision points, then add detail where real incidents show you a blind spot. Every outage is a lesson in what you wish you'd been recording — so let your observability grow along the contours of where the system actually surprises you, rather than instrumenting everything up front.
- What can it tell me in production — or is it a black box when it breaks? - The golden signals — am I tracking latency, traffic, errors, and saturation? - Logs — structured, leveled, and focused on the decisions and failures that matter? - Tracing — can I follow one request across the services it touches? - Correlation — do metrics, logs, and traces share a request ID? - Alerts and an SLO — do I page on what users feel, against a defined target?
- observability / monitoring — answering unforeseen questions, versus watching known ones. - log / structured / level — the detailed diary, made queryable and filterable. - metric / counter / gauge / dashboard — numbers over time, the two shapes, and how you view them. - trace / span — one request's path across services, broken into timed steps. - the four golden signals — latency, traffic, errors, saturation. - alert / SLI / SLO / SLA — the notification, the measure, the target, the promise. - alert fatigue / correlation (request ID) — the noise trap, and the thread that ties it all together.
- When something breaks, the system tells you where to look instead of leaving you guessing. - You move metric → trace → log to go from noticing to locating to explaining. - A shared request ID ties one incident's signals together. - Your alerts fire on user-facing symptoms and people still trust them. - You have explicit SLOs, and you instrumented as you built, not during the outage.
Good observability is built in: golden signals from the start, structured logs, traceable requests, correlated by ID, and a few trustworthy alerts against a real SLO.