THN Interview Prep

Design Uber (Ride-Hailing Platform)

1. Requirements

Functional

  • Riders can request rides by pickup and drop-off locations; see ETA, fare estimate, and driver assignment.
  • Drivers go online/offline, accept or decline offers, navigate to pickup and drop-off, complete trips.
  • Live trip tracking for rider and support; surge pricing when demand exceeds supply.
  • Payments after trip completion; ratings and dispute hooks (payment detail out of scope here).
  • Support for multiple vehicle categories and regional rules.

Non-Functional

  • Scale: tens of millions of DAU globally; peak write QPS for location updates in the hundreds of thousands; matching path low tens of thousands QPS per metro.
  • Latency: driver discovery and match offer under ~2–5 s p99 in dense cities; ETA read p99 under ~500 ms when cached.
  • Availability: core dispatch and safety-critical paths target 99.99%; maps and estimates may degrade gracefully.
  • Consistency: strong consistency for trip state machine and payments handoff; eventual for ETA caches and driver lists near boundaries.
  • Durability: trip records and audit logs durable for years; location streams retained per policy.

Out of Scope

  • Full in-house maps and routing (assume integration with third-party mapping).
  • Tax, regulatory filings, and insurance product design.
  • In-car hardware and telematics beyond mobile SDK assumptions.

2. Back-of-Envelope Estimations

Assume 50M DAU, 5 rides per active user per month on average (mixed worldwide).

  • Trips per day: (50\text{M} \times 5 / 30 \approx 8.3\text{M}) trips/day.

  • Peak factor ~3× daily average for metros: ~300k new trip requests/hour peak globally → ~85 QPS average new requests, ~250+ QPS peak per shard concerns; location pings dominate: if 2M drivers update every 4 s, writes (\approx 500\text{k}) location updates/s globally (sharded by region).

  • Storage: ~10 KB metadata per trip + ~1 KB audit/events → ~100 GB/day trip data raw; 5 years (\approx 180\text{ TB}) before compression and tiering; location history often sampled or short TTL.

  • Bandwidth: location streams mostly internal; client maps tiles from CDN. Assume ~200 MB/day per 1M active drivers** of compressed deltas internal only at edge aggregation.

  • Cache: hot ETAs and surge multipliers; 80/20 on popular corridors suggests ~20% of city cells drive ~80% traffic — size ETA cache to cover that working set per region (often single-digit GB Redis per metro).

For reasoning discipline on orders of magnitude, align with back-of-envelope notes.

Safety & fraud (rough capacity): ~1% of sessions may trigger risk checks—budget ~1–3k QPS peak for synchronous scoring calls to a feature service backed by offline models; async enrichment for lower-risk cohorts keeps the match path fast.

Driver churn: If average driver session is 4 hours online and 500k concurrent drivers peak globally, expect ~125k session start/end events per hour just from lifecycle—not counting forced logout—which informs auth and presence cluster sizing separate from trip matching.

3. API Design

POST /v1/trips
Body: { riderId, pickup: { lat, lng }, dropoff: { lat, lng }, productType }
-> 202 { tripId, status: "matching", estimatedFareRange }

PATCH /v1/trips/{tripId}/drivers/{driverId}/accept
-> 200 { tripId, status: "driver_assigned" }

POST /v1/drivers/{driverId}/location
Body: { lat, lng, heading, timestamp, accuracyMeters }
-> 204

GET /v1/trips/{tripId}
-> 200 { tripId, status, driverId?, route?, fare? }

GET /v1/eta?originLat=&originLng=&destLat=&destLng=
-> 200 { durationSeconds, distanceMeters, surgeMultiplier }

Errors: 409 state conflict, 429 rate limited (see rate limiter), 503 region overloaded.

DELETE /v1/trips/{tripId}
Body: { reasonCode }
-> 204

GET /v1/drivers/nearby?lat=&lng=&radiusMeters=&limit=
-> 200 { drivers: [{ driverId, etaSeconds, rating }] }

4. Data Model

  • Trip: tripId (ULID), riderId, driverId?, status (enum), pickup, dropoff, fareSnapshot, createdAt, updatedAt.
  • DriverSession: driverId, regionId, online, vehicleId, lastLocation, lastSeen.
  • SurgeCell: cellId (e.g. geohash prefix), multiplier, validUntil.

SQL (Postgres) for trip money and lifecycle (ACID). Redis for driver presence buckets and ETA cache. DynamoDB or Cassandra optional for high-volume location writes per region if Postgres cannot scale writes—tradeoffs per CAP/PACELC.

Indexes: Trip(riderId, createdAt DESC), Trip(driverId, createdAt DESC), DriverSession(regionId, online).

Sample trip row: (tripId, rider_42, driver_7, en_route, …).

5. High-Level Architecture

Loading diagram…

Clients hit CDN for static and config; API gateway terminates TLS and applies auth. Trip service owns the trip state machine. Location ingress batches to Kafka; consumers update Redis geospatial indexes. ETA service combines cached segments with partner map APIs. See load balancing and message queues.

6. Component Deep-Dives

  • Matching worker: Pull candidate drivers from Redis GEO within radius; rank by distance, rating, acceptance history. Use incremental reranking on location updates. Algorithm: bounded-radius nearest-neighbor with scoring—not pure brute force over all drivers.

  • Surge: Aggregate demand/supply per geohash cell (see geohash / quadtree); update multipliers on intervals to avoid thrash.

  • Trip state machine: Single writer per tripId; optimistic locking or row locks in Postgres for transitions.

  • Location pipeline: At-least-once from mobile; server dedupes by (driverId, sequence); stale updates dropped by timestamp.

  • Failure modes: Kafka lag → fall back to wider match radius with higher latency cap; Redis outage → degrade to DB-backed coarse regions (worse UX).

  • Observability: OpenTelemetry traces across match → offer → accept with baggage for tripId; SLO dashboards on p99 offer latency and empty-pool rate. Chaos tests for Redis partition validate graceful degradation without 5xx storms.

  • Regulatory data: location and trip history subject to retention and deletion requests—cold storage for completed trips with PII minimization on analytics exports.

7. Bottlenecks & Mitigations

  • Hot cells: Stadium events spike one geohash—shard matching by cell + overflow queues; cap concurrent offers per driver to prevent starvation elsewhere.
  • Thundering herd on surge: Precompute multipliers; use jittered TTL on ETA cache; request coalescing for identical OD pairs.
  • Celebrity driver (rare): Rate-limit visibility or rotate candidates.
  • Head-of-line blocking in matching: Partition matching queues per metro; backpressure on trip creation if worker lag exceeds SLO.

8. Tradeoffs

DecisionAlternativeWhy we picked
Redis GEO + KafkaDirect Postgres for locationsWrite volume and radius queries need in-memory geo
Partner maps APIBuild maps in-houseTime-to-market and global coverage
Push offers to driversLong polling onlyLatency for acceptance UX
Strong trip state in SQLEvent sourcing everywhereSimpler correctness for payments bridge

9. Follow-ups (interviewer drill-downs)

  • What if matching QPS 100×? Horizontal workers per region; partition Kafka by regionId; consider approximate geo indexes.

  • Exactly-once location? End-to-end impossible; use idempotent ingest keys and monotonic timestamps (idempotency).

  • Data model migration? Dual-write new trip fields; shadow read validation; cutover per region.

  • Multi-region active-active? Trip authority single-region per trip; cross-region read replicas for analytics.

  • Cost? Tiered storage for history; compress location archives; narrow partner map API calls via caching; right-size Redis clusters per metro after measuring OBS cardinality of active drivers per cell not global DAU alone.

  • Fraud / safety? Combine device attestation, route plausibility checks, synthetic GPS heuristics, and risk scores for new accounts; shadow high-risk match outcomes before money movement. If you add a regional orchestrator for HA, use a small raft group or lease-based primary as in replication patterns, not ad-hoc leader guesswork.

  • Observability budgets? Trace sampling must rise automatically during incidents—store per-region SLO burn alerts so on-call does not drown in 100% trace volume forever.

Last updated on

Spotted something unclear or wrong on this page?

On this page