Design Google Maps (Consumer Mapping & Routing)
1. Requirements
Functional
- Display interactive maps: pan, zoom, rotate; show roads, labels, POIs, traffic overlays.
- Search for places and addresses; autocomplete as user types.
- Compute turn-by-turn routes for driving, walking, transit; reroute on deviation.
- Live traffic-aware ETA; incident and closure awareness (sourced from feeds + probes).
- Offline map packs for selected regions (simplified).
Non-Functional
- Scale: billions of MAU; global read-heavy tile and static map API QPS in the millions at peak; routing hundreds of thousands QPS aggregate.
- Latency: vector tile fetch p99 < 100 ms from edge; route compute p99 < 200–500 ms depending on distance and mode.
- Availability: 99.99% for read path; routing may degrade to cached or simpler graphs during incidents.
- Consistency: eventual for tile updates; strong for user-saved places when stored in account backend.
- Durability: map data versioned; user data durable; probe streams retained per policy.
Out of Scope
- Full autonomous vehicle HD maps and LiDAR layers.
- Indoor mapping at millimeter accuracy.
- Legal licensing of third-party map data beyond high-level assumptions.
2. Back-of-Envelope Estimations
Assume 1B MAU, 10 map sessions/user/day average (varies).
-
Tile requests: vector tiles ~10–30 per session during pan/zoom → 100B–300B tile requests/day upper bound; many cached at CDN. Effective origin QPS after CDN: ~50k–200k aggregate (high CDN hit ratio > 90%).
-
Routing: 5% of sessions request a route → 500M routes/day → ~6k QPS average, 50k+ peak with skew to commute hours.
-
Storage: global base map petabyte scale in object storage; incremental updates TB/day. User saved lists smaller (sharded SQL).
-
Bandwidth: tiles highly compressible vector protobuf; PB/month egress mostly from CDN (see CDN).
-
Cache: popular viewport tiles and static route overview polylines—80/20 metro coverage suggests regional edge caches of tens of TB aggregate.
Tie estimation hygiene to scalability framing.
Index build churn: If 1% of graph edges change daily from construction feeds, batch graph deltas nightly → ~500M edge updates/day at web-scale mapper assumptions—streaming updates patch subgraphs while global ranks refresh offline.
Directions compute budget: A continental route touching millions of edges in worst naive graphs is impossible at latency SLOs—precomputed shortcuts shrink explored vertices to 10³–10⁴ typical for long drives; always sanity-check average explored fringe in telemetry before buying more CPU.
3. API Design
GET /v1/tiles/{z}/{x}/{y}?style=roads&v={dataVersion}
-> 200 application/x-protobuf-tile
GET /v1/places/autocomplete?input=...&sessionToken=...
-> 200 { predictions: [{ placeId, mainText, secondaryText }] }
POST /v1/directions
Body: { origin, destination, mode: "driving"|"walking"|"transit", departAt?, avoidTolls? }
-> 200 { legs: [...], polyline, durationSeconds, distanceMeters }
GET /v1/traffic/segments?bbox=...
-> 200 { segments: [{ edgeId, speedKph, confidence }] }Errors: 400 bad bbox, 404 unknown tile version (client refresh), 503 routing overload.
GET /v1/directions/matrix
Body: { origins: [...], destinations: [...], mode }
-> 200 { durations: [[seconds]] }4. Data Model
- Road graph: directed graph stored in partitioned tiles; edge attributes (speed, lanes, turn restrictions). Often columnar + custom in-memory representation for routing engines.
- Tile manifest:
z/x/y→ object storage key +dataVersionfor cache invalidation. - Place:
placeId, names, geometry, categories, hours (served from search index + KV). - UserSavedPlace (account scope): in Postgres with
userId,placeId,label.
SQL for user data; search on inverted index (Elasticsearch/OpenSearch) for autocomplete; object store (S3/GCS) for tiles. Graph may live in custom binary + replication per region—see replication.
Indexes: search n-grams on place names; spatial index for “near me”.
5. High-Level Architecture
Clients load tiles from CDN backed by object storage. Route service loads graph shards into memory with cache; traffic merges probe stream aggregates into Redis or in-memory edge speeds. API gateway shapes traffic; edge TLS and WAF pair with load balancing pools.
6. Component Deep-Dives
-
Tile pipeline: Cartographic builds produce versioned tiles; immutable versions enable long CDN TTL; clients pass
v=to bust cache on data rollouts. -
Globe ↔ plane tiles: Web Mercator projection introduces polar distortion—routing stays on the graph, but tile pyramids must handle anti-meridian wrapping (±180°) so clients stitch cleanly across the dateline; tests often miss this until Pacific island QA tracks bugs.
-
Routing: Multi-level graph (highway hierarchy) + A* / Contraction Hierarchies / SHARC-class algorithms for speed; precompute shortcuts per region; partition at administrative boundaries to parallelize.
-
Traffic: Floating car data + government feeds → speed per graph edge; smooth temporally; confidence down-weights sparse edges.
-
Search/autocomplete: Session tokens reduce billing/charge units; prefix trie + popularity ranking; personal history overlay.
-
Failure: If live traffic stale, fall back to historical speed profiles; if route shard unavailable, return longer alternate or queue request.
-
Personalization (optional): Home/work learning improves default map viewport; session cache of recent O-D pairs on device reduces redundant route calls—server must stay stateless beyond signed session cookies.
-
Accessibility & i18n: Voice search adds ASR hop—separate p99 budget; RTL layouts affect tile label placement but not vector protobuf contract.
7. Bottlenecks & Mitigations
- Hot tiles: World cities—over-replicate at CDN; rate limit abusive tile scrapers.
- Routing hotspots: Major metros—dedicated graph replicas; request coalescing for identical OD during incidents.
- Search fan-out: Autocomplete QPS—edge caching of top prefixes; bloom-style skip lists for negative cache where applicable (bloom filter).
- Probe ingest storm: Kafka backpressure; sample low-value probes; aggregate in stream windows.
8. Tradeoffs
| Decision | Alternative | Why we picked |
|---|---|---|
| Vector tiles | Raster only | Bandwidth and zoom smoothness |
| Precomputed graph augmentations | Pure Dijkstra | Latency at continental scale |
| Partner + crowd traffic | Only static speeds | Real ETA quality |
| Regional graph partitions | Single global graph in RAM | Feasible memory and update isolation |
9. Follow-ups (interviewer drill-downs)
-
100× traffic? More regional route pools; approximations for long distance; batch similar requests.
-
Exactly-once for probes? Use idempotent device batch IDs (idempotency).
-
Model migration? Versioned tile format; dual-read polylines; force client min-version cut.
-
Multi-region active-active? Tile reads local; routing authoritative per trip session sticky to region graph replica with cross-border handoff rules.
-
Cost? Aggressive CDN caching; compress protobuf; deprioritize rare tile corners.
-
Accessibility of routing? Stairs and wheelchair access flags on graph edges require per-mode cost functions in router (not one Dijkstra run); precompute overlays per mode to keep p99 stable.
-
Data corrections? User submitted edits go through review queue; roll forward tile version only after validation to avoid map flapping in CDN caches worldwide.
-
Battery-aware routing? Cycling mode should penalize grade and intersections, not just duration—expose cost functions per profile in router service so product can iterate without rewriting graph tiles.
Last updated on
Spotted something unclear or wrong on this page?