Design Pastebin

1. Requirements

Functional

Users upload text (or small structured content) and receive a shareable URL to read the same content.
Optional syntax highlighting is a presentation concern; core system stores raw text or blob reference.
Optional password protection, burn-after-read, and max view count.
Time-to-live: pastes expire and become inaccessible (410/404 per product).
List or search for authenticated owners (optional product); public anonymous paste must be rate-limited.
Optional raw vs rendered download endpoint.

Non-Functional

Scale: 10M DAU, 1:20 create-to-read ratio typical (many pastes are write-once, read-few); some pastes go viral.
Create: 50/s sustained, peaks 500/s; read: 1,000/s sustained, peaks 20k/s for hot pastes.
Latency: first byte for small paste read p99 under 150 ms from regional POP; create p99 under 400 ms including blob write.
Availability: 99.9% for read; 99.5% for create during partial regional outage if multi-region is enabled.
Consistency: read-your-writes for author after create; public readers can be eventually consistent with short delay.
Durability: 11 nines for object storage when using S3 standard; metadata in RDBMS with backups.

Out of Scope

Real-time collaborative editing (CRDT/OT).
End-to-end encryption for pastes (design can mention optional client-side encryption as future).
Full-text search across all public pastes (only owner search or nothing).
Arbitrary binary executable hosting and malware scanning beyond basic MIME checks.

2. Back-of-Envelope Estimations

Assume 10M DAU, 0.01 pastes per user per day on average = 100k pastes/day. Average paste size 20 KB (code snippets dominate; tail includes logs up to max 1 MB product cap).

Create QPS: 100k / 86,400 ~ 1.2/s; peaks 100/s to 500/s with bots.
Read QPS: 20x creates ~ 2M reads/day ~ 23/s sustained; viral paste can spike to 10k/s for a single key (CDN critical).

Storage: 100k * 20 KB = 2 GB/day raw; 730 GB/year. With 3x replication in object store, billable is higher; include 5-year archive ~ 3.5 TB before lifecycle to IA/Glacier.

Metadata row ~ 500 B * 100k/day = 50 MB/day in DB; trivial compared to blobs.

Egress: 2M reads * 20 KB = 40 GB/day typical; viral day 10x.

Cache: hot pastes (top 0.1% of 30M pastes in 1 year) ~ 30k * 1 MB cap = 30 GB in edge + 5 GB in origin Redis for metadata hot set.

3. API Design

POST /v1/pastes
Content-Type: application/json
Body: { "content": "...", "syntax": "python", "ttlSeconds": 604800, "visibility": "unlisted" }
-> 201 { "pasteId": "7f3a9b2c", "url": "https://paste.example/7f3a9b2c" }
-> 413 { "error": "payload_too_large" }
-> 429 { "error": "rate_limited" }

GET /v1/pastes/{pasteId}
Headers: If-None-Match: "version"
-> 200 text/plain or application/json
-> 404 { "error": "not_found" }
-> 410 { "error": "expired" }

GET /v1/pastes/{pasteId}/raw
-> 200 (raw body, Content-Disposition inline)

DELETE /v1/pastes/{pasteId}
Authorization: Bearer <token>
-> 204

Internal gRPC: PutPaste, GetPasteMetadata, GetPasteObject for storage workers.

4. Data Model

Entities

paste: paste_id (ULID), owner_id nullable, storage_key, size_bytes, content_type, syntax, created_at, expires_at, visibility, etag.
access_log (optional): for rate/abuse only, not full content logging.

Object storage vs inline DB

Blobs must live in S3-compatible object storage (AWS S3, GCS, MinIO) for cost and throughput; inline DB only for tiny payloads under 4 KB if product wants single round trip—usually not worth operational split.

SQL

PostgreSQL for metadata: rich constraints, partial indexes on expires_at for TTL sweeper, JSONB for optional settings. Use read replicas for GET-heavy workload.
NoSQL alternative: DynamoDB with paste_id as PK; GSI on expires_at as sparse index for deletion scanner—good when already on AWS and want TTL feature.

Indexes

PK: paste_id.
Index: expires_at for batch expiry job.
Optional: owner_id, created_at for listing.

Sample metadata row

paste_id	storage_key	expires_at	size_bytes
01JABC…	s3://bucket/ab/c1/7f3a9b2c	2026-05-07T00:00:00Z	18432

5. High-Level Architecture

Loading diagram…

Reference: caching for edge and Redis, message queues for async delete propagation if needed.

6. Component Deep-Dives

Upload path

Rate limit per IP/user using a distributed rate limiter (token bucket in Redis/Envoy).
Size validate at API and optionally pre-signed direct-to-S3 upload for large payloads to keep API servers stateless.
Generate paste_id with ULID (time-sortable, URL-safe) vs UUIDv4; ULID helps range scans and log correlation (ID generation).
Write object to S3 with StorageClass standard; SSE-S3 or KMS for compliance.
Transaction insert metadata; on failure, async compensating delete of orphan object (S3 lifecycle can also reap unreferenced keys with tag-based GC).

Read path

CDN key GET /v1/pastes/{id}: use cache key including visibility and password hash presence; do not cache password-protected content at shared edge without Vary on auth—often bypass CDN for those.
Redis cache full small pastes or only metadata + storage pointer; for large bodies, signed URL redirect to S3 with short TTL so origin does not stream bytes.
S3 range reads for huge pastes if product allows partial view.

Expiration

Sweeper job: query expires_at < now in pages; delete S3 object, delete row. DynamoDB TTL is alternative for stateless expiry events.
CDN must invalidate or rely on short TTL; prefer short max-age with strong ETag.

Why S3 over NFS or block storage

Erasure coding, multi-AZ durability, and per-request pricing match spiky read patterns; EFS/NFS adds operational pain and hard global scaling.

Why Postgres over only Dynamo

If complex admin queries and migrations matter; if fully serverless on AWS, Dynamo + Step Functions for sweeper wins on ops.

7. Bottlenecks & Mitigations

Issue	What breaks	Mitigation
Viral read on one paste	Origin S3 and small Redis hot key	Edge cache with ETag; replicate object to many regional buckets; consistent hashing for cache shards
Write amplification on update	Re-editing large files	Immutable pastes only in MVP; versioned S3 keys on edit
Expiry job lag	Stale 200 from CDN	`s-maxage` low for pastes with short TTL; active purge via CDN API on delete
Cost at rest	Many tiny objects	Pack small pastes <4 KB in DB or merge into daily batch files for cold tier (complexity trade)
Abuse	Spam pastes	CAPTCHA, account walls, rate limiter integration

8. Tradeoffs

Decision	Alternative	Why we picked
S3 for body	DB BLOB	Cost, throughput, and lifecycle policies
ULID paste_id	Auto-increment	No coordination across regions; sortable
Pre-signed S3 upload	API proxy all bytes	Offload bandwidth and CPU from API tier
PostgreSQL metadata	MongoDB	ACID for metadata + object consistency story
Public CDN cache	Always origin	Latency and cost; privacy modes bypass
Kafka for abuse pipeline	Synchronous scan	Decouples strict path; optional feature

9. Follow-ups (interviewer drill-downs)

How to support search for my pastes? GIN index on to_tsvector in Postgres for title/tags; not global search.
Multi-region active-active? Write to local region S3; cross-region replication for hot data; global table for metadata with conflict resolution on rare same-ID (should not happen with ULID from one allocator per region + prefix).
Encryption at rest only vs client-side? Document threat model; client-side keys mean we cannot search content.
1000x read spike? Rely on CDN; if dynamic, use distributed cache layer and request coalescing.
GDPR delete? Hard delete S3 + metadata + purge CDN; audit log the request id.
LLD for chunking large files? Presigned multipart upload, complete callback to register storage_key in paste row.

On this page