THN Interview Prep

Design URL Shortener

1. Requirements

Functional

  • Create a short alias for a long HTTPS URL; returned alias is unique and resolves to the original target.
  • Resolve short codes via HTTP redirect (301 for SEO-stable links, 302 when analytics need every hit).
  • Optional custom aliases when allowed by policy (length, charset, reserved words).
  • Optional expiration time and soft-delete for abuse or user-initiated removal.
  • Optional click analytics: aggregate counts per short code, coarse geo/device bucketing.
  • Admin-style APIs for lookup-by-owner and revocation (out of scope for public MVP unless product requires it).

Non-Functional

  • Scale: 100M DAU for global consumer product scale interview framing; 10:1 read:write ratio typical for shorteners.
  • Writes: ~500 short URLs/s sustained at that scale (with bursts on campaigns); reads dominated by popular links.
  • Latency: resolve p99 under 50 ms at edge excluding client RTT; create p99 under 200 ms including DB round trips.
  • Availability: 99.99% for reads; 99.9% for creates acceptable if rare failures retry.
  • Consistency: strong uniqueness on create (no two codes for same policy collision); redirect reads tolerate eventual consistency if CDN caches 301; analytics eventual.
  • Durability: long-term persistence of mapping; audit trail optional.

Out of Scope

  • Full malware/phishing classification pipeline (assume integration with third-party safe-browsing APIs only at edge).
  • Billing, subscriptions, and organization SSO beyond a stub identity layer.
  • Arbitrary URL shortening for non-HTTP(S) schemes beyond validation rules.
  • Graph analytics across redirect graphs.

2. Back-of-Envelope Estimations

Assume 100M DAU, each user creates 0.1 short URLs per day on average (10M new shorts/day) and follows 20 redirects/day average (2B reads/day).

  • Write QPS: 10M / 86,400 s ~ 116/s sustained; peak 5–10x ~ 600–1,200/s.
  • Read QPS: 2B / 86,400 ~ 23,200/s sustained; peak ~ 100k/s with viral links.

Storage for mappings (code, long URL, metadata ~500 bytes per row): 10M * 500 B ~ 5 GB/day; ~1.8 TB/year; 5 years ~ 9 TB before compaction of expired rows.

Bandwidth: average redirect response ~300 B body + headers ~ 1 KB; 2B * 1 KB ~ 2 TB/day egress from origin before CDN (CDN absorbs most).

Cache working set: top 1% of codes might drive 80% of reads (Pareto). If 1B distinct codes exist over time but hot set is 10M codes * 500 B ~ 5 GB in a regional Redis tier plus CDN edge keys.

Show math explicitly in interviews: always relate DAU to actions, then to QPS, then to fan-out and storage.


3. API Design

REST over HTTPS; JSON bodies; idempotency key header on create for safe retries.

POST /v1/short-urls
Headers: Idempotency-Key: <uuid>
Body: { "targetUrl": "https://example.com/a/b?c=1", "ttlSeconds": 2592000, "customAlias": "spring-sale" }
-> 201 { "shortCode": "aB3dEf", "shortUrl": "https://go.example/aB3dEf", "expiresAt": "2026-05-30T12:00:00Z" }
-> 409 { "error": "alias_taken" }
-> 422 { "error": "invalid_target_url" }
GET /v1/short-urls/{shortCode}
-> 302 Location: <targetUrl>  (or 301 if configured immutable)
-> 404 { "error": "not_found" }
-> 410 { "error": "expired" }
GET /v1/short-urls/{shortCode}/stats?granularity=day
-> 200 { "clicks": [ { "date": "2026-04-29", "count": 12034 } ] }

Internal gRPC (optional) for high-volume create from trusted services: CreateShortUrl, ResolveShortUrl with protobuf for lower overhead than JSON.


4. Data Model

Entities

  • short_url: short_code (PK, 6–8 char base62), target_url (text), owner_id (nullable), created_at, expires_at, redirect_type (301/302), idempotency_key unique.
  • click_aggregate: short_code, bucket_start, count — for analytics rollups.

SQL vs NoSQL

  • PostgreSQL (with Citus or manual sharding) fits strong uniqueness, transactional creates, and secondary indexes on owner_id and expires_at for cleanup jobs. Good when team expertise is relational and traffic fits vertically scaled writer with read replicas.
  • Amazon DynamoDB (or Cassandra) fits truly horizontal write scaling and predictable partition-level performance; use short_code as partition key for O(1) point lookups; GSI on idempotency_key for deduplication. Pick DynamoDB when managed ops and multi-region active-active matter more than rich SQL.

Indexes

  • Primary: short_code.
  • Unique: idempotency_key when present.
  • Optional GSI/BTREE on expires_at for TTL sweeper workers.

Sample row (DynamoDB item)

short_codetarget_urlexpires_atredirect_type
k8mP2xhttps://shop.example/p/abc2026-12-01T00:00:00Z301

5. High-Level Architecture

Clients hit a global anycast edge (CDN) for GET redirects; CDN misses go to regional API + cache. Creates go through API gateway to write path. Analytics buffered to a stream. See also caching and load balancing patterns.

Loading diagram…

6. Component Deep-Dives

Short code generation (counter vs random)

  • Segmented counter in a dedicated service (Snowflake-style epoch + worker id + sequence) encoded base62 gives dense codes and no collision retries; requires coordinated workers (ID generation discipline).
  • Crypto random 80-bit token base62 needs uniqueness checks and occasional retry; simpler ops but variable length and harder human typing.

Pick counter-backed encoding when you want fixed length and monotonic debuggability; pick random when decentralizing generation at edge without coordination.

Resolve read path

  1. CDN checks edge cache key /{code}; on hit return 301/302 without origin.
  2. On miss, resolve service checks Redis (caching) for code -> target_url + flags.
  3. On Redis miss, load from DynamoDB/Postgres, backfill Redis with TTL, return redirect.
  4. Async publish click event to Kafka with short code, timestamp, anonymized IP bucket — never block redirect on analytics.

Create write path

  1. Validate URL scheme/host policy.
  2. Idempotency: if Idempotency-Key seen, return prior result from store.
  3. Allocate code via counter service or insert with random code + retry on conflict.
  4. Write to primary store; replicate async to second region if multi-master pattern used.

Hot link mitigation

  • Extreme viral URL becomes hot key in Redis; use consistent hashing across Redis shards so load spreads; request coalescing for identical keys during spikes.

7. Bottlenecks & Mitigations

RiskSymptomMitigation
Hot short codeSingle Redis shard saturatedLocal micro-cache in resolve pods; replicate read-heavy keys to multiple cache replicas; CDN 301 long TTL
Write leader limitPostgres primary CPU boundShard by hash(short_code) or move writes to DynamoDB partitions
Thundering herd on expiryMany clients refresh huge CDN populationStaggered TTL; soft expiry with background refresh
Analytics backlogKafka lag growsScale consumers; sample clicks 1/N for tail codes
Code exhaustion6-char base62 ~ 56B space but operational bugs reuseMonitor collision rate; lengthen code automatically

8. Tradeoffs

DecisionAlternativeWhy we picked
DynamoDB for mapping at mega scalePostgres + CitusPartition scalability and cross-region replication without heavy DBA
Redis in front of DBMemcached onlyRich TTL + Lua for atomic touch + optional geo replication (caching)
Kafka for click streamSQS per regionReplay and unified analytics pipeline; higher ops cost acceptable
301 default for stable campaigns302 onlySEO and CDN efficiency; product chooses per link
Server-side counter IDsRandom base62Predictable length and collision-free allocation under load
Edge-first redirectsEvery request to originLatency and cost; mandatory for this problem class

9. Follow-ups (interviewer drill-downs)

  • What if read traffic grows 100x overnight on one code? Discuss CDN-only tier, anycast, and optional static export of top URLs to edge KV.
  • How do you migrate from Postgres to DynamoDB without downtime? Dual-write, shadow reads, cutover with feature flag.
  • Multi-region active-active for creates? Conflict resolution on idempotency keys; vector clocks rarely needed if codes globally unique via allocator.
  • Exactly-once analytics? At-least-once into Kafka + idempotent aggregates downstream; dedupe by (code, click_id).
  • Cost optimization? Aggressive CDN TTL for 301, cold storage for old analytics, right-size Redis by evicting cold codes from cache but not DB.
  • Abuse and phishing? Rate limit creates per account/IP using a distributed rate limiter design.

Last updated on

Spotted something unclear or wrong on this page?

On this page