THN Interview Prep

Design Pastebin

1. Requirements

Functional

  • Users upload text (or small structured content) and receive a shareable URL to read the same content.
  • Optional syntax highlighting is a presentation concern; core system stores raw text or blob reference.
  • Optional password protection, burn-after-read, and max view count.
  • Time-to-live: pastes expire and become inaccessible (410/404 per product).
  • List or search for authenticated owners (optional product); public anonymous paste must be rate-limited.
  • Optional raw vs rendered download endpoint.

Non-Functional

  • Scale: 10M DAU, 1:20 create-to-read ratio typical (many pastes are write-once, read-few); some pastes go viral.
  • Create: 50/s sustained, peaks 500/s; read: 1,000/s sustained, peaks 20k/s for hot pastes.
  • Latency: first byte for small paste read p99 under 150 ms from regional POP; create p99 under 400 ms including blob write.
  • Availability: 99.9% for read; 99.5% for create during partial regional outage if multi-region is enabled.
  • Consistency: read-your-writes for author after create; public readers can be eventually consistent with short delay.
  • Durability: 11 nines for object storage when using S3 standard; metadata in RDBMS with backups.

Out of Scope

  • Real-time collaborative editing (CRDT/OT).
  • End-to-end encryption for pastes (design can mention optional client-side encryption as future).
  • Full-text search across all public pastes (only owner search or nothing).
  • Arbitrary binary executable hosting and malware scanning beyond basic MIME checks.

2. Back-of-Envelope Estimations

Assume 10M DAU, 0.01 pastes per user per day on average = 100k pastes/day. Average paste size 20 KB (code snippets dominate; tail includes logs up to max 1 MB product cap).

  • Create QPS: 100k / 86,400 ~ 1.2/s; peaks 100/s to 500/s with bots.
  • Read QPS: 20x creates ~ 2M reads/day ~ 23/s sustained; viral paste can spike to 10k/s for a single key (CDN critical).

Storage: 100k * 20 KB = 2 GB/day raw; 730 GB/year. With 3x replication in object store, billable is higher; include 5-year archive ~ 3.5 TB before lifecycle to IA/Glacier.

Metadata row ~ 500 B * 100k/day = 50 MB/day in DB; trivial compared to blobs.

Egress: 2M reads * 20 KB = 40 GB/day typical; viral day 10x.

Cache: hot pastes (top 0.1% of 30M pastes in 1 year) ~ 30k * 1 MB cap = 30 GB in edge + 5 GB in origin Redis for metadata hot set.


3. API Design

POST /v1/pastes
Content-Type: application/json
Body: { "content": "...", "syntax": "python", "ttlSeconds": 604800, "visibility": "unlisted" }
-> 201 { "pasteId": "7f3a9b2c", "url": "https://paste.example/7f3a9b2c" }
-> 413 { "error": "payload_too_large" }
-> 429 { "error": "rate_limited" }
GET /v1/pastes/{pasteId}
Headers: If-None-Match: "version"
-> 200 text/plain or application/json
-> 404 { "error": "not_found" }
-> 410 { "error": "expired" }
GET /v1/pastes/{pasteId}/raw
-> 200 (raw body, Content-Disposition inline)
DELETE /v1/pastes/{pasteId}
Authorization: Bearer <token>
-> 204

Internal gRPC: PutPaste, GetPasteMetadata, GetPasteObject for storage workers.


4. Data Model

Entities

  • paste: paste_id (ULID), owner_id nullable, storage_key, size_bytes, content_type, syntax, created_at, expires_at, visibility, etag.
  • access_log (optional): for rate/abuse only, not full content logging.

Object storage vs inline DB

  • Blobs must live in S3-compatible object storage (AWS S3, GCS, MinIO) for cost and throughput; inline DB only for tiny payloads under 4 KB if product wants single round trip—usually not worth operational split.

SQL

  • PostgreSQL for metadata: rich constraints, partial indexes on expires_at for TTL sweeper, JSONB for optional settings. Use read replicas for GET-heavy workload.
  • NoSQL alternative: DynamoDB with paste_id as PK; GSI on expires_at as sparse index for deletion scanner—good when already on AWS and want TTL feature.

Indexes

  • PK: paste_id.
  • Index: expires_at for batch expiry job.
  • Optional: owner_id, created_at for listing.

Sample metadata row

paste_idstorage_keyexpires_atsize_bytes
01JABC…s3://bucket/ab/c1/7f3a9b2c2026-05-07T00:00:00Z18432

5. High-Level Architecture

Loading diagram…

Reference: caching for edge and Redis, message queues for async delete propagation if needed.


6. Component Deep-Dives

Upload path

  1. Rate limit per IP/user using a distributed rate limiter (token bucket in Redis/Envoy).
  2. Size validate at API and optionally pre-signed direct-to-S3 upload for large payloads to keep API servers stateless.
  3. Generate paste_id with ULID (time-sortable, URL-safe) vs UUIDv4; ULID helps range scans and log correlation (ID generation).
  4. Write object to S3 with StorageClass standard; SSE-S3 or KMS for compliance.
  5. Transaction insert metadata; on failure, async compensating delete of orphan object (S3 lifecycle can also reap unreferenced keys with tag-based GC).

Read path

  1. CDN key GET /v1/pastes/{id}: use cache key including visibility and password hash presence; do not cache password-protected content at shared edge without Vary on auth—often bypass CDN for those.
  2. Redis cache full small pastes or only metadata + storage pointer; for large bodies, signed URL redirect to S3 with short TTL so origin does not stream bytes.
  3. S3 range reads for huge pastes if product allows partial view.

Expiration

  • Sweeper job: query expires_at < now in pages; delete S3 object, delete row. DynamoDB TTL is alternative for stateless expiry events.
  • CDN must invalidate or rely on short TTL; prefer short max-age with strong ETag.

Why S3 over NFS or block storage

  • Erasure coding, multi-AZ durability, and per-request pricing match spiky read patterns; EFS/NFS adds operational pain and hard global scaling.

Why Postgres over only Dynamo

  • If complex admin queries and migrations matter; if fully serverless on AWS, Dynamo + Step Functions for sweeper wins on ops.

7. Bottlenecks & Mitigations

IssueWhat breaksMitigation
Viral read on one pasteOrigin S3 and small Redis hot keyEdge cache with ETag; replicate object to many regional buckets; consistent hashing for cache shards
Write amplification on updateRe-editing large filesImmutable pastes only in MVP; versioned S3 keys on edit
Expiry job lagStale 200 from CDNs-maxage low for pastes with short TTL; active purge via CDN API on delete
Cost at restMany tiny objectsPack small pastes <4 KB in DB or merge into daily batch files for cold tier (complexity trade)
AbuseSpam pastesCAPTCHA, account walls, rate limiter integration

8. Tradeoffs

DecisionAlternativeWhy we picked
S3 for bodyDB BLOBCost, throughput, and lifecycle policies
ULID paste_idAuto-incrementNo coordination across regions; sortable
Pre-signed S3 uploadAPI proxy all bytesOffload bandwidth and CPU from API tier
PostgreSQL metadataMongoDBACID for metadata + object consistency story
Public CDN cacheAlways originLatency and cost; privacy modes bypass
Kafka for abuse pipelineSynchronous scanDecouples strict path; optional feature

9. Follow-ups (interviewer drill-downs)

  • How to support search for my pastes? GIN index on to_tsvector in Postgres for title/tags; not global search.
  • Multi-region active-active? Write to local region S3; cross-region replication for hot data; global table for metadata with conflict resolution on rare same-ID (should not happen with ULID from one allocator per region + prefix).
  • Encryption at rest only vs client-side? Document threat model; client-side keys mean we cannot search content.
  • 1000x read spike? Rely on CDN; if dynamic, use distributed cache layer and request coalescing.
  • GDPR delete? Hard delete S3 + metadata + purge CDN; audit log the request id.
  • LLD for chunking large files? Presigned multipart upload, complete callback to register storage_key in paste row.

Last updated on

Spotted something unclear or wrong on this page?

On this page