Design a Large-Scale Web Search Engine

1. Requirements

Functional

Crawl the web (respect robots, politeness), discover links, fetch pages.
Index textual content; rank documents for a query with relevance + quality signals.
Serve query → ranked URLs + snippets at interactive latency; support pagination and refinements.
Continuous refresh for popular/recent pages; spam and unsafe-content demotion (high level).
Vertical integrations (news, images) as extensions—core web search assumed.

Non-Functional

Scale: billions of indexed documents; query QPS 10⁵–10⁶+ at global leaders; crawl sustained hundreds of billions of pages/day attempted at planet scale (actual fetched subset policy-driven).
Latency: p99 ~200–400 ms for web results over median networks—heavy caching for head queries.
Availability: 99.99% for query path; crawl pipeline may lag without blocking reads.
Consistency: eventual between crawl, index, and serve; no cross-query ACID requirement (CAP theorem).
Durability: raw crawl store and index generations durable; multiple checkpoints.

Out of Scope

Full ads auction and quality-based pricing.
Legal jurisdiction for delisting (policy hook only).
AGI answer synthesis—classical ten blue links + snippets focus.

2. Back-of-Envelope Estimations

Assume 50B indexed pages average 20 KB compressed text extract → ~1 PB index shard footprint order-of-magnitude before replication; link graph structures additional 100s TB.

Queries: 100k QPS average global → 1M+ peak; head query cache hit 30–60% depending on locale—massive CDN-like front cache for suggestion + results.
Crawl: 30T+ pages discovered exists; fetch budget limits actual retrieval—politeness caps per host often single-digit QPS—drives massive distributed frontier.
Bandwidth: crawl egress from web heterogeneous; internal PB/day move between stages.
Storage growth: incremental indexing reduces full rebuild frequency; still ETL jobs hourly/daily for some tiers.
RAM: query serving replicas cache posting lists hot terms—tens of TB aggregate across clusters (caching doc).

Snippet generation: Storing positional indices adds ~30–50% inverted index size but enables bold highlights—trade memory vs snippet quality per tier.

Query fan-out: If mean query length is 2.5 terms and each term touches 3 posting-list shards, expect 7–8 internal RPCs before ranking—budget connection pools and hedged requests accordingly (api gateway timeouts must exceed shard p99, not mean).

3. API Design

GET /v1/search?q=system+design&start=0&num=10&safe=active
-> 200 {
     results: [{ url, title, snippet, rank }],
     totalApproximate: 12345678,
     spellSuggestion?: string
   }

GET /v1/instant?q=sys
-> 200 { completions: [{ text, score }] }

POST /v1/debug/fetch-info  (internal)
Body: { url }
-> 200 { lastCrawled, robotsAllowed }

Errors: 400 query too long, 429 suspicious traffic (rate limiter).

GET /v1/search/news?q=&from=
-> 200 { items: [{ url, title, publishedAt }] }

4. Data Model

Document: docId (64-bit), url, checksum, crawlTimestamp, language, pagerank-like score materialized or computed offline.
Inverted index: term → posting list of (docId, positions, tf) compressed (var-byte, PFOR).
Forward index: docId → token stream for snippets (optional partial store).
Link graph: sparse edge lists for PageRank / SALSA class features offline.

Storage: GFS/S3 for WARC-like blobs; Bigtable/Cassandra for serving shards; batch MapReduce/Spark/Beam for graph and ML features. Relational DB too small for inverted index—use sharded inverted files (sharding).

5. High-Level Architecture

Loading diagram…

Query path: root fans out to retrieval shards (term partitions), union posting lists, prune by top-k heap; ranking adds 100s of features in mixer. Crawl is continuous loop separate from user path. Pub/sub style queues connect stages.

6. Component Deep-Dives

Crawl and URL frontier details are covered in the companion web-crawler case; here we focus on serving and index structure.

Segmented inverted index: Posting lists are append-friendly; background merge compacts small segments into larger immutable files so read amplification stays bounded (LSM-style), tuned for boolean AND over text terms. Block-decompressed iterators are cache-friendly; see caching.
Scoring at retrieval vs rerank: BM25-style scores can be computed while walking postings to avoid a second pass; heavier neural or tree ensemble rerankers run only on the top few hundred candidates so p99 stays bounded.
Phrase and proximity: Positional postings or auxiliary next-word structures cost space; a two-pass phrase filter (narrow candidates, then verify positions) is a common cost knob.
Snippets: After doc candidates are fixed, pull windows from a forward store or inverted positions; cache snippet text for head queries to save CPU.
Freshness: Tiered crawl schedules feed delta segments; the ranking mixer can boost recency when intent classifiers label a query as news-like.
Failure: Promote replicas for serving shards; if the ranker overloads, degrade to retrieval-only with a simpler static rank rather than failing the whole query path.
Indexer backpressure: Crawl hot domains can starve segment builders—coordinate link scheduling with indexer throughput (see web-crawler) so trending topics still get new segments without violating politeness.

7. Bottlenecks & Mitigations

Hot queries: Multilayer cache; precomputed instant answers for weather-like intents (optional).
Celebrity spike: rate limit suggestions; isolate shard overload by shedding tail terms.
Crawl hot domains: Strict per-host throttle; distributed fetchers with shared token buckets.
Index merge storms: Throttle compaction windows; read from older segments while merging.
SafeSearch / region: Classifier scores affect ranking multiplier—cache per-query intent bucket to avoid duplicate infer calls across shards.

8. Tradeoffs

Decision	Alternative	Why we picked
Inverted index	Full scan	Latency impossible at web scale
Approximate totals	Exact counts	Costly for huge corpora
Batch PageRank	Online only	Link signals need global passes
Eventually fresh index	Strong consistency	Crawl lag inherent

9. Follow-ups (interviewer drill-downs)

100× query load? More cache; reduce ranking complexity under pressure; regional isolation.
Exactly-once indexing? Idempotent doc versions by content hash (idempotency).
Index migration? Blue/green index generations; shadow traffic validation.
Multi-region? Partition shards geographically; sticky query routing for cache warmth.
Cost? Cheap cold storage for raw crawl; prune low-value URLs early; GPU ranking only for tail after filtering.
Query understanding? Spell correct and synonym expansion can blow up posting list unions—cache rewritten query AST and cap max OR clauses per tier.
Personalization? Click history feeds user embeddings; must not leak profile across shared devices—session isolation and opt-out flags in profile store.
Synonym explosion? Hand-curated synsets help recall but can poison precision—gate expansion behind query classifier confidence and log exploded queries for offline review.

On this page