THN Interview Prep

Design Instagram (Photo Sharing Feed)

1. Requirements

Functional

  • Users upload images/videos; server generates multiple renditions (thumbnail, standard, HD) and serves via CDN.
  • Follow graph drives personalized feed and stories ephemeral content (24h) as parallel product surfaces sharing infra.
  • Engagement: like, comment, save, share to stories.
  • Direct messaging as separate real-time path (touch lightly or reference WhatsApp patterns).
  • Explore tab recommendations beyond social graph (ML ranking).

Non-Functional

  • Scale: 1B+ MAU class; tens of millions media uploads/day; billions of feed impressions/day.
  • Latency: upload acknowledgements fast; processing async; feed first paint p99 under 400 ms with cache warm.
  • Availability: 99.99% for reads; uploads tolerate retries with resumable protocol.
  • Consistency: eventual fan-out to followers acceptable seconds-level; likes counts eventually consistent with tolerance UI +/- delta.
  • Durability: no media loss — object storage 11 9s; metadata transactional.

Out of Scope

  • Full Reels recommendation ML stack depth.
  • Payment and shopping checkout flows.
  • Content moderation CV models training — assume human review tooling hooks only.
  • IGTV long-form separate product complexity.

2. Back-of-Envelope Estimations

Uploads: 50M media items/day (photo+video blended teaching number); peak 2x–5x daily average during events → 1,000–3,000/s sustained peaks higher.

Feed reads: 500M DAU * 80 opens/day ~ 40B impressions metric confusion — better: per session 10 feed API calls * 500M * fraction active → 100k–500k RPS origin globally after CDN — still enormous; edge cache ineffective for personalized feed except pieces.

Storage: Average 3 MB processed photo + video much larger — assume average effective 15 MB stored per item including transcoded ladders → 50M * 15 MB ~ 750 TB/day new media — in reality dedupe and compression vary; interview uses order-of-magnitude: petabyte-scale yearly media footprint.

CDN egress: Dominant cost driver; multi-megabyte video segments streamed — terabits per second aggregate during peaks — requires tier-1 CDN contracts.

Metadata: 50M * 1 KB ~ 50 GB/day posts rows — small vs media.


3. API Design

POST /v1/media/upload-session
-> 201 { "uploadId": "up_123", "uploadUrls": [ { "rendition": "original", "url": "https://s3..." } ] }
POST /v1/posts
Body: { "mediaIds": ["m1"], "caption": "sunset", "locationId": null }
-> 201 { "postId": "p_456" }
GET /v1/feed/home?cursor=...
-> 200 { "items": [ { "postId": "...", "media": [...], "author": {...} } ] }
POST /v1/media/{mediaId}/likes
-> 204
GET /v1/users/{id}/feed
-> 200 { "items": [...] }

Resumable uploads: tus protocol or S3 multipart presigned URLs — large video resilience.


4. Data Model

Post

  • post_id, author_id, caption, created_at, media_ids[], location, visibility.

Media

  • media_id, owner_id, s3_keys per rendition, width, height, duration, codec.

Engagement

  • likes table or counter column sharded; Redis for hot counters with periodic flush to Cassandra/Scylla.

Feed timeline

  • Same pattern as Twitter: Cassandra partition per viewer with ordered post_id.

Why not single Postgres

  • Write fan-out and sheer row volume — relational kept for accounts, billing hooks, some OLTP; media metadata may live in sharded MySQL (Instagram historically) or DynamoDB — interview acknowledges evolution.

Indexes

  • (author_id, created_at) for profiles.
  • Geo indexes optional Elasticsearch for location discovery.

Sample media row

media_idowner_idrenditionscreated_at
m789u22{ thumb: "s3://...", std: "s3://..." }2026-04-29

5. High-Level Architecture

Loading diagram…

6. Component Deep-Dives

Upload & processing

  • Client uploads to S3 via presigned URL (removes API servers from data path) — why vs proxy upload: bandwidth and horizontal scale (load balancing not bottlenecked).
  • SQS/Lambda or Kafka + FFmpeg workers transcode ladders HLS for video; photos libvips thumbnails.
  • Magic bytes virus scan with ClamAV in async pipeline — not blocking first byte if product accepts rare bad frame removal later.

Feed generation

  • Same fan-out on write vs pull hybrid as Twitter (news feed); stories use shorter TTL partitions.

Ranking

  • Explore uses ML features in Galaxy-style feature store (conceptual): Redis online, Hive offline; TensorFlow Serving — bespoke vs cloud Vertex AI trade on org maturity.

CDN

  • CloudFront / Akamai with signed URLs for private accounts vs long-cache public influencers — cache key includes user auth variant carefully.

Caching

  • Memcached historically popular for hot object metadata at Meta-scale; Redis cluster for session + graph fragments (caching).

Stories

  • Separate Cassandra table or same with ttl_seconds row-level — Redis expiring keys insufficient alone for durability.

7. Bottlenecks & Mitigations

BottleneckScenarioMitigation
Transcode backlogViral video spikesAutoscale workers on queue depth; shed load by delaying non-critical renditions
Feed cold startNew userOnboarding suggestions from popular graph
Hot influencer postFan-out stormHybrid pull + dynamic insertion
CDN origin overloadCache miss stormOrigin shield; internal tier caching
Counter inconsistencyLike mashingIdempotent like API; CRDT-style optional overkill — settle for tolerance

8. Tradeoffs

DecisionAlternativeWhy we picked
S3 + async processingDisk on API boxElastic capacity and durability
Kafka event backboneRabbitMQDurability at millions msgs/min
Cassandra feedDynamoDBSimilar; pick based on org cloud
Presigned direct uploadProxied multipartAPI tier CPU and NIC preservation
HLS streamingSingle MP4Adaptive bitrate mobile networks
Separate Explore rankChronological onlyEngagement product requirement

9. Follow-ups (interviewer drill-downs)

  • Live video? Separate low-latency RTMP ingest → packaging → CDN WebRTC vs HLS latency trade.
  • Copyright detection? Perceptual hashing (pHash) pipeline async + legal tooling.
  • DM E2E? Signal Protocol style — massive scope; reference WhatsApp doc.
  • Deletion: Remove from S3 (lifecycle policy), purge CDN with wildcard invalidation cost awareness.
  • Cost optimization: Serve lower bitrate in constrained markets — device capability hints header.

Last updated on

Spotted something unclear or wrong on this page?

On this page