Time-Series Storage (Metrics)
What it is
Time-series databases and pipelines store (timestamp, value, labels) samples—metrics from hosts, apps, and custom events. They optimize for append-heavy writes, time-range queries, and aggregation over windows.
Metrics retention
- High-resolution raw data kept for short periods (hours to days)—bounded disk cost.
- Lower-resolution rollups kept longer (months to years) for trends and SLO reporting.
- Cardinality explosion (unbounded label values) is a primary ops risk—control label sets.
now-15m: raw 15s resolution
now-30d: 5m rollups
now-1y: 1h rollupsDownsampling
Downsampling aggregates older buckets (mean, max, min, percentiles with care—averages of percentiles are misleading). Reduces storage and speeds long-range queries at cost of losing spike detail in ancient data.
Common stack pieces: Prometheus (pull, local TSDB), InfluxDB, TimescaleDB, Datadog-style SaaS, OpenTelemetry for ingestion.
When to use
- Observability: dashboards, alerts, capacity planning.
- IoT device telemetry with heavy ingest (may overlap with streaming—see message-queue-vs-stream).
Alternatives
- General SQL for small metric volume: simpler; worse at ingest scale and compression.
- Logging systems (ELK) for events—not optimized like TSDB for numeric series.
Failure modes
- Cardinality and churn blow memory and index size.
- Clock skew across emitters distorts ordering—use bounded skew handling or trust ingestion time.
- Alert fatigue from noisy metrics without good rollups and SLO windows.
Interview talking points
- Explicit retention and downsampling policy; tie alerts to SLIs and error budgets.
- Ingest path: samples/sec, label cardinality, replication—use back-of-envelope.
- Read path latency-throughput: query fan-out for global dashboards.
Related fundamentals
Last updated on
Spotted something unclear or wrong on this page?