Design Pastebin
1. Requirements
Functional
- Users upload text (or small structured content) and receive a shareable URL to read the same content.
- Optional syntax highlighting is a presentation concern; core system stores raw text or blob reference.
- Optional password protection, burn-after-read, and max view count.
- Time-to-live: pastes expire and become inaccessible (410/404 per product).
- List or search for authenticated owners (optional product); public anonymous paste must be rate-limited.
- Optional raw vs rendered download endpoint.
Non-Functional
- Scale: 10M DAU, 1:20 create-to-read ratio typical (many pastes are write-once, read-few); some pastes go viral.
- Create: 50/s sustained, peaks 500/s; read: 1,000/s sustained, peaks 20k/s for hot pastes.
- Latency: first byte for small paste read p99 under 150 ms from regional POP; create p99 under 400 ms including blob write.
- Availability: 99.9% for read; 99.5% for create during partial regional outage if multi-region is enabled.
- Consistency: read-your-writes for author after create; public readers can be eventually consistent with short delay.
- Durability: 11 nines for object storage when using S3 standard; metadata in RDBMS with backups.
Out of Scope
- Real-time collaborative editing (CRDT/OT).
- End-to-end encryption for pastes (design can mention optional client-side encryption as future).
- Full-text search across all public pastes (only owner search or nothing).
- Arbitrary binary executable hosting and malware scanning beyond basic MIME checks.
2. Back-of-Envelope Estimations
Assume 10M DAU, 0.01 pastes per user per day on average = 100k pastes/day. Average paste size 20 KB (code snippets dominate; tail includes logs up to max 1 MB product cap).
- Create QPS: 100k / 86,400 ~ 1.2/s; peaks 100/s to 500/s with bots.
- Read QPS: 20x creates ~ 2M reads/day ~ 23/s sustained; viral paste can spike to 10k/s for a single key (CDN critical).
Storage: 100k * 20 KB = 2 GB/day raw; 730 GB/year. With 3x replication in object store, billable is higher; include 5-year archive ~ 3.5 TB before lifecycle to IA/Glacier.
Metadata row ~ 500 B * 100k/day = 50 MB/day in DB; trivial compared to blobs.
Egress: 2M reads * 20 KB = 40 GB/day typical; viral day 10x.
Cache: hot pastes (top 0.1% of 30M pastes in 1 year) ~ 30k * 1 MB cap = 30 GB in edge + 5 GB in origin Redis for metadata hot set.
3. API Design
POST /v1/pastes
Content-Type: application/json
Body: { "content": "...", "syntax": "python", "ttlSeconds": 604800, "visibility": "unlisted" }
-> 201 { "pasteId": "7f3a9b2c", "url": "https://paste.example/7f3a9b2c" }
-> 413 { "error": "payload_too_large" }
-> 429 { "error": "rate_limited" }GET /v1/pastes/{pasteId}
Headers: If-None-Match: "version"
-> 200 text/plain or application/json
-> 404 { "error": "not_found" }
-> 410 { "error": "expired" }GET /v1/pastes/{pasteId}/raw
-> 200 (raw body, Content-Disposition inline)DELETE /v1/pastes/{pasteId}
Authorization: Bearer <token>
-> 204Internal gRPC: PutPaste, GetPasteMetadata, GetPasteObject for storage workers.
4. Data Model
Entities
paste:paste_id(ULID),owner_idnullable,storage_key,size_bytes,content_type,syntax,created_at,expires_at,visibility,etag.access_log(optional): for rate/abuse only, not full content logging.
Object storage vs inline DB
- Blobs must live in S3-compatible object storage (AWS S3, GCS, MinIO) for cost and throughput; inline DB only for tiny payloads under 4 KB if product wants single round trip—usually not worth operational split.
SQL
- PostgreSQL for metadata: rich constraints, partial indexes on
expires_atfor TTL sweeper, JSONB for optional settings. Use read replicas for GET-heavy workload. - NoSQL alternative: DynamoDB with
paste_idas PK; GSI onexpires_atas sparse index for deletion scanner—good when already on AWS and want TTL feature.
Indexes
- PK:
paste_id. - Index:
expires_atfor batch expiry job. - Optional:
owner_id, created_atfor listing.
Sample metadata row
| paste_id | storage_key | expires_at | size_bytes |
|---|---|---|---|
| 01JABC… | s3://bucket/ab/c1/7f3a9b2c | 2026-05-07T00:00:00Z | 18432 |
5. High-Level Architecture
Reference: caching for edge and Redis, message queues for async delete propagation if needed.
6. Component Deep-Dives
Upload path
- Rate limit per IP/user using a distributed rate limiter (token bucket in Redis/Envoy).
- Size validate at API and optionally pre-signed direct-to-S3 upload for large payloads to keep API servers stateless.
- Generate
paste_idwith ULID (time-sortable, URL-safe) vs UUIDv4; ULID helps range scans and log correlation (ID generation). - Write object to S3 with
StorageClassstandard; SSE-S3 or KMS for compliance. - Transaction insert metadata; on failure, async compensating delete of orphan object (S3 lifecycle can also reap unreferenced keys with tag-based GC).
Read path
- CDN key
GET /v1/pastes/{id}: use cache key includingvisibilityandpasswordhash presence; do not cache password-protected content at shared edge withoutVaryon auth—often bypass CDN for those. - Redis cache full small pastes or only metadata + storage pointer; for large bodies, signed URL redirect to S3 with short TTL so origin does not stream bytes.
- S3 range reads for huge pastes if product allows partial view.
Expiration
- Sweeper job: query
expires_at < nowin pages; delete S3 object, delete row. DynamoDB TTL is alternative for stateless expiry events. - CDN must invalidate or rely on short TTL; prefer short max-age with strong ETag.
Why S3 over NFS or block storage
- Erasure coding, multi-AZ durability, and per-request pricing match spiky read patterns; EFS/NFS adds operational pain and hard global scaling.
Why Postgres over only Dynamo
- If complex admin queries and migrations matter; if fully serverless on AWS, Dynamo + Step Functions for sweeper wins on ops.
7. Bottlenecks & Mitigations
| Issue | What breaks | Mitigation |
|---|---|---|
| Viral read on one paste | Origin S3 and small Redis hot key | Edge cache with ETag; replicate object to many regional buckets; consistent hashing for cache shards |
| Write amplification on update | Re-editing large files | Immutable pastes only in MVP; versioned S3 keys on edit |
| Expiry job lag | Stale 200 from CDN | s-maxage low for pastes with short TTL; active purge via CDN API on delete |
| Cost at rest | Many tiny objects | Pack small pastes <4 KB in DB or merge into daily batch files for cold tier (complexity trade) |
| Abuse | Spam pastes | CAPTCHA, account walls, rate limiter integration |
8. Tradeoffs
| Decision | Alternative | Why we picked |
|---|---|---|
| S3 for body | DB BLOB | Cost, throughput, and lifecycle policies |
| ULID paste_id | Auto-increment | No coordination across regions; sortable |
| Pre-signed S3 upload | API proxy all bytes | Offload bandwidth and CPU from API tier |
| PostgreSQL metadata | MongoDB | ACID for metadata + object consistency story |
| Public CDN cache | Always origin | Latency and cost; privacy modes bypass |
| Kafka for abuse pipeline | Synchronous scan | Decouples strict path; optional feature |
9. Follow-ups (interviewer drill-downs)
- How to support search for my pastes? GIN index on
to_tsvectorin Postgres for title/tags; not global search. - Multi-region active-active? Write to local region S3; cross-region replication for hot data; global table for metadata with conflict resolution on rare same-ID (should not happen with ULID from one allocator per region + prefix).
- Encryption at rest only vs client-side? Document threat model; client-side keys mean we cannot search content.
- 1000x read spike? Rely on CDN; if dynamic, use distributed cache layer and request coalescing.
- GDPR delete? Hard delete S3 + metadata + purge CDN; audit log the request id.
- LLD for chunking large files? Presigned multipart upload, complete callback to register
storage_keyinpasterow.
Last updated on
Spotted something unclear or wrong on this page?