Design Uber (Ride-Hailing Platform)

1. Requirements

Functional

Riders can request rides by pickup and drop-off locations; see ETA, fare estimate, and driver assignment.
Drivers go online/offline, accept or decline offers, navigate to pickup and drop-off, complete trips.
Live trip tracking for rider and support; surge pricing when demand exceeds supply.
Payments after trip completion; ratings and dispute hooks (payment detail out of scope here).
Support for multiple vehicle categories and regional rules.

Non-Functional

Scale: tens of millions of DAU globally; peak write QPS for location updates in the hundreds of thousands; matching path low tens of thousands QPS per metro.
Latency: driver discovery and match offer under ~2–5 s p99 in dense cities; ETA read p99 under ~500 ms when cached.
Availability: core dispatch and safety-critical paths target 99.99%; maps and estimates may degrade gracefully.
Consistency: strong consistency for trip state machine and payments handoff; eventual for ETA caches and driver lists near boundaries.
Durability: trip records and audit logs durable for years; location streams retained per policy.

Out of Scope

Full in-house maps and routing (assume integration with third-party mapping).
Tax, regulatory filings, and insurance product design.
In-car hardware and telematics beyond mobile SDK assumptions.

2. Back-of-Envelope Estimations

Assume 50M DAU, 5 rides per active user per month on average (mixed worldwide).

Trips per day: (50\text{M} \times 5 / 30 \approx 8.3\text{M}) trips/day.
Peak factor ~3× daily average for metros: ~300k new trip requests/hour peak globally → ~85 QPS average new requests, ~250+ QPS peak per shard concerns; location pings dominate: if 2M drivers update every 4 s, writes (\approx 500\text{k}) location updates/s globally (sharded by region).
Storage: ~10 KB metadata per trip + ~1 KB audit/events → ~100 GB/day trip data raw; 5 years (\approx 180\text{ TB}) before compression and tiering; location history often sampled or short TTL.
Bandwidth: location streams mostly internal; client maps tiles from CDN. Assume ~200 MB/day per 1M active drivers** of compressed deltas internal only at edge aggregation.
Cache: hot ETAs and surge multipliers; 80/20 on popular corridors suggests ~20% of city cells drive ~80% traffic — size ETA cache to cover that working set per region (often single-digit GB Redis per metro).

For reasoning discipline on orders of magnitude, align with back-of-envelope notes.

Safety & fraud (rough capacity): ~1% of sessions may trigger risk checks—budget ~1–3k QPS peak for synchronous scoring calls to a feature service backed by offline models; async enrichment for lower-risk cohorts keeps the match path fast.

Driver churn: If average driver session is 4 hours online and 500k concurrent drivers peak globally, expect ~125k session start/end events per hour just from lifecycle—not counting forced logout—which informs auth and presence cluster sizing separate from trip matching.

3. API Design

POST /v1/trips
Body: { riderId, pickup: { lat, lng }, dropoff: { lat, lng }, productType }
-> 202 { tripId, status: "matching", estimatedFareRange }

PATCH /v1/trips/{tripId}/drivers/{driverId}/accept
-> 200 { tripId, status: "driver_assigned" }

POST /v1/drivers/{driverId}/location
Body: { lat, lng, heading, timestamp, accuracyMeters }
-> 204

GET /v1/trips/{tripId}
-> 200 { tripId, status, driverId?, route?, fare? }

GET /v1/eta?originLat=&originLng=&destLat=&destLng=
-> 200 { durationSeconds, distanceMeters, surgeMultiplier }

Errors: 409 state conflict, 429 rate limited (see rate limiter), 503 region overloaded.

DELETE /v1/trips/{tripId}
Body: { reasonCode }
-> 204

GET /v1/drivers/nearby?lat=&lng=&radiusMeters=&limit=
-> 200 { drivers: [{ driverId, etaSeconds, rating }] }

4. Data Model

Trip: tripId (ULID), riderId, driverId?, status (enum), pickup, dropoff, fareSnapshot, createdAt, updatedAt.
DriverSession: driverId, regionId, online, vehicleId, lastLocation, lastSeen.
SurgeCell: cellId (e.g. geohash prefix), multiplier, validUntil.

SQL (Postgres) for trip money and lifecycle (ACID). Redis for driver presence buckets and ETA cache. DynamoDB or Cassandra optional for high-volume location writes per region if Postgres cannot scale writes—tradeoffs per CAP/PACELC.

Indexes: Trip(riderId, createdAt DESC), Trip(driverId, createdAt DESC), DriverSession(regionId, online).

Sample trip row: (tripId, rider_42, driver_7, en_route, …).

5. High-Level Architecture

Loading diagram…

Clients hit CDN for static and config; API gateway terminates TLS and applies auth. Trip service owns the trip state machine. Location ingress batches to Kafka; consumers update Redis geospatial indexes. ETA service combines cached segments with partner map APIs. See load balancing and message queues.

6. Component Deep-Dives

Matching worker: Pull candidate drivers from Redis GEO within radius; rank by distance, rating, acceptance history. Use incremental reranking on location updates. Algorithm: bounded-radius nearest-neighbor with scoring—not pure brute force over all drivers.
Surge: Aggregate demand/supply per geohash cell (see geohash / quadtree); update multipliers on intervals to avoid thrash.
Trip state machine: Single writer per tripId; optimistic locking or row locks in Postgres for transitions.
Location pipeline: At-least-once from mobile; server dedupes by (driverId, sequence); stale updates dropped by timestamp.
Failure modes: Kafka lag → fall back to wider match radius with higher latency cap; Redis outage → degrade to DB-backed coarse regions (worse UX).
Observability: OpenTelemetry traces across match → offer → accept with baggage for tripId; SLO dashboards on p99 offer latency and empty-pool rate. Chaos tests for Redis partition validate graceful degradation without 5xx storms.
Regulatory data: location and trip history subject to retention and deletion requests—cold storage for completed trips with PII minimization on analytics exports.

7. Bottlenecks & Mitigations

Hot cells: Stadium events spike one geohash—shard matching by cell + overflow queues; cap concurrent offers per driver to prevent starvation elsewhere.
Thundering herd on surge: Precompute multipliers; use jittered TTL on ETA cache; request coalescing for identical OD pairs.
Celebrity driver (rare): Rate-limit visibility or rotate candidates.
Head-of-line blocking in matching: Partition matching queues per metro; backpressure on trip creation if worker lag exceeds SLO.

8. Tradeoffs

Decision	Alternative	Why we picked
Redis GEO + Kafka	Direct Postgres for locations	Write volume and radius queries need in-memory geo
Partner maps API	Build maps in-house	Time-to-market and global coverage
Push offers to drivers	Long polling only	Latency for acceptance UX
Strong trip state in SQL	Event sourcing everywhere	Simpler correctness for payments bridge

9. Follow-ups (interviewer drill-downs)

What if matching QPS 100×? Horizontal workers per region; partition Kafka by regionId; consider approximate geo indexes.
Exactly-once location? End-to-end impossible; use idempotent ingest keys and monotonic timestamps (idempotency).
Data model migration? Dual-write new trip fields; shadow read validation; cutover per region.
Multi-region active-active? Trip authority single-region per trip; cross-region read replicas for analytics.
Cost? Tiered storage for history; compress location archives; narrow partner map API calls via caching; right-size Redis clusters per metro after measuring OBS cardinality of active drivers per cell not global DAU alone.
Fraud / safety? Combine device attestation, route plausibility checks, synthetic GPS heuristics, and risk scores for new accounts; shadow high-risk match outcomes before money movement. If you add a regional orchestrator for HA, use a small raft group or lease-based primary as in replication patterns, not ad-hoc leader guesswork.
Observability budgets? Trace sampling must rise automatically during incidents—store per-region SLO burn alerts so on-call does not drown in 100% trace volume forever.

On this page