Software Architect · Module 15
A reliable system doesn't promise that nothing will break. It limits the damage, recovers quickly, and is honest about its own state.
SLO · error budget · graceful degradation · recovery
Reliability is a managed quality of service with explicit targets — not hope that the code turns out fine.
SLOs translate reliability into product language
"The trains should run well" is impossible to verify. "99% of trains arrive within five minutes of schedule" is something you can measure.
An SLA is a promise to the customer. An SLO is the internal target. An SLI is the measured indicator: availability, latency, error rate, freshness. The error budget says how many failures are tolerable inside the period.
The architect uses SLOs to pick trade-offs: if the error budget is burning, the team dials down release risk; if there's budget to spare, you can experiment faster.
Failure should be an expected scenario
You don't draw up an evacuation plan because you want a fire. You draw it up because a fire is possible.
Reliability is engineered through timeouts, retries, bulkheads, rate limits, circuit breakers, backups, restore drills, health checks, runbooks, and incident response. Prevention matters — so do detection, mitigation, and recovery.
The system has to degrade in a controlled way: if the recommendation engine goes down, checkout still works; if analytics is lagging, the user action shouldn't hang on it.
Reliability isn't proven by a slide deck — it's proven by an incident or a drill.
Example: graceful degradation of search
If the departure board at the airport goes dark, the staff still need a backup list of flights.
The search service is unavailable. The product page still loads, the catalogue shows a fallback of popular items, and the UI tells the user search is temporarily limited. An alert reaches on-call, and the runbook covers the index check and the rollback.
The user sees a feature degrade — not the whole product fall over.
Anti-example: backup without restore
A spare key is useless if no one knows which door it opens.
The team says backups are configured. But restore was never tested, RPO and RTO are unknown, only one person has access to the backups, and the recovery procedure isn't documented.
A backup is only half of reliability. The real guarantee shows up after a successful restore drill.
- Which SLIs actually reflect the user experience? - What's the RPO and RTO for the critical data? - What will keep working when a dependency fails? - When did you last verify a restore?