Design Uber (Ride-Hailing Platform)
1. Requirements
Functional
- Riders can request rides by pickup and drop-off locations; see ETA, fare estimate, and driver assignment.
- Drivers go online/offline, accept or decline offers, navigate to pickup and drop-off, complete trips.
- Live trip tracking for rider and support; surge pricing when demand exceeds supply.
- Payments after trip completion; ratings and dispute hooks (payment detail out of scope here).
- Support for multiple vehicle categories and regional rules.
Non-Functional
- Scale: tens of millions of DAU globally; peak write QPS for location updates in the hundreds of thousands; matching path low tens of thousands QPS per metro.
- Latency: driver discovery and match offer under ~2–5 s p99 in dense cities; ETA read p99 under ~500 ms when cached.
- Availability: core dispatch and safety-critical paths target 99.99%; maps and estimates may degrade gracefully.
- Consistency: strong consistency for trip state machine and payments handoff; eventual for ETA caches and driver lists near boundaries.
- Durability: trip records and audit logs durable for years; location streams retained per policy.
Out of Scope
- Full in-house maps and routing (assume integration with third-party mapping).
- Tax, regulatory filings, and insurance product design.
- In-car hardware and telematics beyond mobile SDK assumptions.
2. Back-of-Envelope Estimations
Assume 50M DAU, 5 rides per active user per month on average (mixed worldwide).
-
Trips per day: (50\text{M} \times 5 / 30 \approx 8.3\text{M}) trips/day.
-
Peak factor ~3× daily average for metros: ~300k new trip requests/hour peak globally → ~85 QPS average new requests, ~250+ QPS peak per shard concerns; location pings dominate: if 2M drivers update every 4 s, writes (\approx 500\text{k}) location updates/s globally (sharded by region).
-
Storage: ~10 KB metadata per trip + ~1 KB audit/events → ~100 GB/day trip data raw; 5 years (\approx 180\text{ TB}) before compression and tiering; location history often sampled or short TTL.
-
Bandwidth: location streams mostly internal; client maps tiles from CDN. Assume ~200 MB/day per 1M active drivers** of compressed deltas internal only at edge aggregation.
-
Cache: hot ETAs and surge multipliers; 80/20 on popular corridors suggests ~20% of city cells drive ~80% traffic — size ETA cache to cover that working set per region (often single-digit GB Redis per metro).
For reasoning discipline on orders of magnitude, align with back-of-envelope notes.
Safety & fraud (rough capacity): ~1% of sessions may trigger risk checks—budget ~1–3k QPS peak for synchronous scoring calls to a feature service backed by offline models; async enrichment for lower-risk cohorts keeps the match path fast.
Driver churn: If average driver session is 4 hours online and 500k concurrent drivers peak globally, expect ~125k session start/end events per hour just from lifecycle—not counting forced logout—which informs auth and presence cluster sizing separate from trip matching.
3. API Design
POST /v1/trips
Body: { riderId, pickup: { lat, lng }, dropoff: { lat, lng }, productType }
-> 202 { tripId, status: "matching", estimatedFareRange }
PATCH /v1/trips/{tripId}/drivers/{driverId}/accept
-> 200 { tripId, status: "driver_assigned" }
POST /v1/drivers/{driverId}/location
Body: { lat, lng, heading, timestamp, accuracyMeters }
-> 204
GET /v1/trips/{tripId}
-> 200 { tripId, status, driverId?, route?, fare? }
GET /v1/eta?originLat=&originLng=&destLat=&destLng=
-> 200 { durationSeconds, distanceMeters, surgeMultiplier }Errors: 409 state conflict, 429 rate limited (see rate limiter), 503 region overloaded.
DELETE /v1/trips/{tripId}
Body: { reasonCode }
-> 204
GET /v1/drivers/nearby?lat=&lng=&radiusMeters=&limit=
-> 200 { drivers: [{ driverId, etaSeconds, rating }] }4. Data Model
- Trip:
tripId(ULID),riderId,driverId?,status(enum),pickup,dropoff,fareSnapshot,createdAt,updatedAt. - DriverSession:
driverId,regionId,online,vehicleId,lastLocation,lastSeen. - SurgeCell:
cellId(e.g. geohash prefix),multiplier,validUntil.
SQL (Postgres) for trip money and lifecycle (ACID). Redis for driver presence buckets and ETA cache. DynamoDB or Cassandra optional for high-volume location writes per region if Postgres cannot scale writes—tradeoffs per CAP/PACELC.
Indexes: Trip(riderId, createdAt DESC), Trip(driverId, createdAt DESC), DriverSession(regionId, online).
Sample trip row: (tripId, rider_42, driver_7, en_route, …).
5. High-Level Architecture
Clients hit CDN for static and config; API gateway terminates TLS and applies auth. Trip service owns the trip state machine. Location ingress batches to Kafka; consumers update Redis geospatial indexes. ETA service combines cached segments with partner map APIs. See load balancing and message queues.
6. Component Deep-Dives
-
Matching worker: Pull candidate drivers from Redis GEO within radius; rank by distance, rating, acceptance history. Use incremental reranking on location updates. Algorithm: bounded-radius nearest-neighbor with scoring—not pure brute force over all drivers.
-
Surge: Aggregate demand/supply per geohash cell (see geohash / quadtree); update multipliers on intervals to avoid thrash.
-
Trip state machine: Single writer per
tripId; optimistic locking or row locks in Postgres for transitions. -
Location pipeline: At-least-once from mobile; server dedupes by
(driverId, sequence); stale updates dropped by timestamp. -
Failure modes: Kafka lag → fall back to wider match radius with higher latency cap; Redis outage → degrade to DB-backed coarse regions (worse UX).
-
Observability: OpenTelemetry traces across match → offer → accept with baggage for
tripId; SLO dashboards on p99 offer latency and empty-pool rate. Chaos tests for Redis partition validate graceful degradation without 5xx storms. -
Regulatory data: location and trip history subject to retention and deletion requests—cold storage for completed trips with PII minimization on analytics exports.
7. Bottlenecks & Mitigations
- Hot cells: Stadium events spike one geohash—shard matching by cell + overflow queues; cap concurrent offers per driver to prevent starvation elsewhere.
- Thundering herd on surge: Precompute multipliers; use jittered TTL on ETA cache; request coalescing for identical OD pairs.
- Celebrity driver (rare): Rate-limit visibility or rotate candidates.
- Head-of-line blocking in matching: Partition matching queues per metro; backpressure on trip creation if worker lag exceeds SLO.
8. Tradeoffs
| Decision | Alternative | Why we picked |
|---|---|---|
| Redis GEO + Kafka | Direct Postgres for locations | Write volume and radius queries need in-memory geo |
| Partner maps API | Build maps in-house | Time-to-market and global coverage |
| Push offers to drivers | Long polling only | Latency for acceptance UX |
| Strong trip state in SQL | Event sourcing everywhere | Simpler correctness for payments bridge |
9. Follow-ups (interviewer drill-downs)
-
What if matching QPS 100×? Horizontal workers per region; partition Kafka by
regionId; consider approximate geo indexes. -
Exactly-once location? End-to-end impossible; use idempotent ingest keys and monotonic timestamps (idempotency).
-
Data model migration? Dual-write new trip fields; shadow read validation; cutover per region.
-
Multi-region active-active? Trip authority single-region per trip; cross-region read replicas for analytics.
-
Cost? Tiered storage for history; compress location archives; narrow partner map API calls via caching; right-size Redis clusters per metro after measuring OBS cardinality of active drivers per cell not global DAU alone.
-
Fraud / safety? Combine device attestation, route plausibility checks, synthetic GPS heuristics, and risk scores for new accounts; shadow high-risk match outcomes before money movement. If you add a regional orchestrator for HA, use a small raft group or lease-based primary as in replication patterns, not ad-hoc leader guesswork.
-
Observability budgets? Trace sampling must rise automatically during incidents—store per-region SLO burn alerts so on-call does not drown in 100% trace volume forever.
Last updated on
Spotted something unclear or wrong on this page?