THN Interview Prep

Availability Math: Serial vs Parallel, SLAs, Downtime

Definition

Availability is the fraction of time a system is operational and correct per its definition (often hand-waved as "uptime"). It is usually expressed as nines: 99.9% ("three nines"), 99.99% (four nines), etc.

Serial availability — Components in series (all must work for success): total availability multiplies (more fragile end-to-end).

Parallel availabilityRedundant paths in parallel (any one can serve): failure probability compounds for "all fail" (more resilient if independent).

Why it matters in interviews

You should translate SLA to downtime minutes per year in your head, and explain why 99.99% for five serial dependencies is almost impossible without massive redundancy and loose coupling. Interviewers test whether you budget error budgets and avoid magic "five nines" claims.

Tradeoffs

  • Higher availability costs redundancy, multi-AZ/region, testing, and operational maturity.
  • Parallel redundancy cuts correlated failures (same bad deploy, same flood) if blast radius is separated.
  • Composite SLAs — Weakest link or explicit dependency math; SLI/SLO culture uses error budgets (see Google SRE).

SLA to approximate downtime (per year)

AvailabilityApproximate max downtime / year
99% ("two nines")~3.65 days
99.9% (three nines)~8.76 hours
99.99% (four nines)~52.6 minutes
99.999% (five nines)~5.26 minutes

(Using ~365.25 days/year; exact marketing numbers vary slightly.)

Serial vs parallel intuition

If each hop has availability A, serial end-to-end is about A^n for n independent serial hops (worse as n grows).

Parallel (one of two nodes suffices, independent failure): probability both down is roughly (1-A)^2 for two, so combined availability higher than a single node when failover works.

Loading diagram…

Concrete examples

  1. API calling auth, payments, inventory in series — Each must meet its SLO or the product SLO is the tightest chain link unless you degrade (cached auth, async inventory).
  2. Multi-AZ databaseParallel storage copies with consensus; serial client path if app and DB each fail independently—end-to-end math still multiplies independent failure assumptions.
  3. Monthly deploy window — If SLO allows 43 minutes downtime/year at 99.99%, one bad 45-minute outage burns the budget—drives canary and rollback investment.

How to say it in 30 seconds

"Availability multiplies in series; redundant paths improve uptime if failover is real. I translate four nines to ~53 minutes/year and ask whether the stack can honestly hit that composed. I design for degradation when a dependency misses SLO."

Common follow-up questions

  • Are AZ failures independent? Often partially correlated (control plane, operator error); math is a model, not truth.
  • What is the difference between availability and durability? Availability is "can I read/write now?"; durability is "will data survive after I acknowledge?"—different metrics.
  • How do SLAs differ from SLOs? SLA is often contractual; SLO is internal target with error budget.

See also: System design curriculum overview

Last updated on

Spotted something unclear or wrong on this page?

On this page