Design a Distributed Web Crawler
1. Requirements
Functional
- Accept seed URLs and continuously discover new URLs via extracted links.
- Fetch HTTP(S) resources; respect robots.txt and crawl-delay semantics per host.
- Parse HTML (and limited types) to emit outlinks and text extracts for indexing pipeline.
- Dedupe URLs and near-duplicate content; maintain per-host politeness schedules.
- Handle failures: timeouts, retries with backoff, DNS failures; rate limit abusive hosts.
Non-Functional
- Scale: billions of URLs/day attempted at large engines; millions of parallel fetches across clusters subject to politeness.
- Throughput: maximize goodput under per-host constraints—not raw QPS alone.
- Availability: 99.99% orchestration; individual fetch failures are normal and absorbed.
- Consistency: eventual visibility of new URLs into frontier; at-least-once processing with idempotent sinks (idempotency).
- Durability: crawl frontier checkpoints; optional raw content WARC archive for replay.
Out of Scope
- Full browser rendering for heavy SPAs (mention headless farm extension).
- Legal compliance program per jurisdiction beyond robots policy hooks.
- Search ranking—only emit documents downstream.
2. Back-of-Envelope Estimations
Assume 50B unique URLs discovered over corpus; 10% churn/month requires refresh scheduling.
-
Fetch attempts: 100M fetches/hour average large system → ~28k URL fetches/s aggregate—but per-host caps mean queue depth huge.
-
Average page 500 KB downloaded (heavy tail); 50 TB/hour raw if unconstrained—actual limited by politeness and budget → single-digit TB/hour typical large crawl.
-
Frontier storage: URL queue with scores—billions of rows—distributed priority queues sharded by host hash.
-
Dedup: 64-bit checksum of normalized URL per shard—100s GB Bloom structures (bloom filter) with periodic rebuild.
-
Parser CPU: thousands of cores continuous; IO wait dominates fetchers.
DNS budget: If average fetch does 2 DNS lookups (CNAME chain), 200M fetches/hour → 400M lookups/hour—shared resolver farm with negative caching is mandatory.
Politeness math: A host with crawl-delay 1s caps at 86,400 fetches/day theoretical maximum—if you must refresh 1M URLs under that host, you need ~12 days unless rules change or you negotiate higher throughput—frontier prioritization cannot violate published constraints without policy exceptions.
Redirect chains: 301/302 cascades multiply fetch latency and can explode depth—cap redirects per URL (for example 5) and treat loops as permanent failures with DLQ inspection.
3. API Design
Internal service boundaries:
POST /v1/frontier/enqueue
Body: { url, priority, discoveredFrom }
-> 202 { queueShard }
GET /v1/fetch/next?workerId=
-> 200 { url, allowedAfterEpochMs, headersPolicy }
POST /v1/fetch/result
Body: { url, statusCode, contentSha256, fetchedAt, links[] }
-> 204
GET /v1/robots/cache?host=
-> 200 { rulesBlob, expiresAt }Errors: 409 duplicate suppressed, 429 worker misbehaving.
POST /v1/frontier/reprioritize
Body: { url, newPriority, reason }
-> 200 { accepted: true }4. Data Model
- UrlRecord: normalized URL, host, path, firstSeen, lastCrawled, nextFetchAfter, changeFrequencyScore.
- HostPolicy: robots rules snapshot id, maxConcurrent, minDelayMs.
- FetchResult:
url,httpStatus,sha256,size,redirectChain.
Frontier: Kafka partitions keyed by host for natural politeness batching, or Redis streams + Cassandra persistence; combination common. Object storage for raw blobs. See message queues.
5. High-Level Architecture
Key pattern: partition by host end-to-end so a worker pulls from a host-partition to respect delay. DNS caching layer avoids storms. Parser emits links back to frontier. Align with load balancing ideas at worker assignment layer.
6. Component Deep-Dives
- URL normalization: Lowercase host, strip fragments, canonicalize default ports, track redirect chains to final URL id.
- Robots: Cache negative and positive; TTL refresh; wildcard rule parsing; failure policy conservative fetch deny when ambiguous under policy.
- Politeness: Token bucket per IP and host; honor Crawl-delay; jitter to avoid sync waves.
- Dedup: Simhash for near-dup pages; exact hash for byte-identical.
- Retries: Exponential backoff; classify 4xx vs 5xx; permanent ban patterns for infinite traps.
- Failure: Poison URL queues → DLQ with manual review tooling.
7. Bottlenecks & Mitigations
-
Hot domains: Many links point to same host—priority accrual without starvation of long tail; per-host fair scheduling.
-
DNS overload: Resolver pool with caching; negative cache NXDOMAIN.
-
Parser CPU: Autoscale parser fleet separately from fetchers; skip non-text giants early via HEAD rules where safe.
-
Frontier monolith: Shard by host hash; avoid global locks.
-
JavaScript-heavy sites: Headless browser farm is 100× costlier per URL—use selective rendering only when signals suggest client-rendered content dominance.
8. Tradeoffs
| Decision | Alternative | Why we picked |
|---|---|---|
| Host-partitioned queues | Global priority heap | Politeness at scale |
| At-least-once delivery | Exactly-once pipe | Network realities |
| Bloom seen-url filter | Full DB membership | Memory vs FP rate |
| Raw blob archive | Parse-only | Debug and reindex |
9. Follow-ups (interviewer drill-downs)
-
100× scale? More regions; independent crawl tiers (fresh vs deep); sampling low-value web neighborhoods.
-
Exactly-once commits to index? Use idempotent
contentIdsinks. -
Migration? Dual frontier writers; compare dequeue rates; cutover per partition.
-
Multi-region? Assign URL ranges to regions; avoid duplicate fetch via global URL→region mapping cache.
-
Cost? Drop marginal domains; compress WARC; spot instances for batch fetch waves.
-
Sitemap priority? XML sitemaps inject URLs with priority hints—blend with PageRank signal in frontier score; beware spam sitemaps from compromised hosts.
-
Internationalization? hreflang clusters should crawl together to avoid duplicate locale variants without canonical awareness—group URL normalizer rules per site template.
-
Robots edge cases? When robots.txt 404 s, conservative policies either allow or deny everything—pick explicitly and document because misconfigured sites swing billions of URLs.
-
TLS everywhere? Mixed content pages still appear—upgrade rules and HSTS preloading change effective URLs; canonicalizer must track protocol upgrades or risk dup clusters.
-
Dynamic IP hosts? Some domains map to CDNs that rotate IPs—bind politeness to hostname identity, not IP, or shared IPs cause accidental cross-host throttling.
-
Storage lifecycle? Raw WARC retention cannot grow unbounded—tier to glacier after N days with legal exceptions for regulated segments only.
-
Metrics that lie? Crawl rate without 200 ratio misleads—always chart status histograms and mean bytes per host to spot soft 404 farms wasting budget.
Last updated on
Spotted something unclear or wrong on this page?