Design Typeahead Autocomplete

1. Requirements

Functional

As user types a query prefix in search box, show top-K suggestions (completed phrases or entities) within tens of milliseconds.
Support popular completions, personalized history (optional), and spell-tolerance light (not full spellcheck engine).
Respect locale and safe-search filtering for suggestions.
Debouncing occurs client-side; server receives prefixes of length >= 2 (configurable).
Ranking by blended score: prefix match quality, global popularity, personalization signals.

Non-Functional

Scale: 500M searches/day on property → typeahead requests often 3–5x that as users revise prefixes → 1.5B suggestions/day ~ 17,400/s average; peaks 150k/s.
Latency: p99 under 40 ms server-side in same region; edge closer to 20 ms with cache.
Availability: 99.99% (degraded mode returns cached popular-only).
Consistency: eventually consistent with index updates; trending terms lag seconds acceptable.
Durability: suggestion corpus derived from offline logs—online index ephemeral with periodic rebuild.

Out of Scope

Full document search results page (only suggestions).
Voice/type multimodal input.
Deep learning seq2seq generation of arbitrary novel phrases (focus classical retrieval + ranking).
Multi-tenant enterprise ACL per suggestion row.

2. Back-of-Envelope Estimations

Requests: 1.5B/day ~ 17.4k/s; peak factor 10x ~ 174k/s globally.

If 40% are duplicates after normalization (“weather ” same popular prefix), effective origin load lower; CDN and merge cache at gateway reduce further.

Payload: ~200 B JSON per suggestion * 8 suggestions ~ 1.6 KB per response; 174k * 1.6 KB ~ 278 MB/s peak egress from suggestion tier—feasible with regional clusters + edge.

Index size: 10M distinct suggestion strings, avg 30 chars UTF-8 + scores + pointers ~ 64 B overhead → ~1 GB base trie-like structures + posting lists; 10–20 GB with replicas and n-grams.

Logging for offline rebuild: sample 1% of queries → still huge; use streaming aggregation to Hive/BigQuery daily job.

3. API Design

GET /v1/suggest?q=weath&sessionId=sess_abc&limit=8
Headers: Accept-Language: en-US
-> 200 {
  "query": "weath",
  "suggestions": [
    { "text": "weather today", "source": "popular", "score": 0.98 },
    { "text": "weather tomorrow", "source": "popular", "score": 0.95 }
  ]
}

GET /v1/suggest/health
-> 200 { "indexVersion": 202604291200, "backend": "ok" }

WebSocket optional for ultra-low latency apps:

WS /v1/suggest/stream
Client sends: { "prefix": "wea" }
Server pushes: { "suggestions": [...] }

Internal gRPC: Suggest(SuggestRequest) returns (SuggestResponse) with protobuf for smaller payloads.

4. Data Model

Trie / prefix tree — in-memory per shard storing compressed branches to descendants.

Inverted auxiliary: map prefix_hash -> list(sorted suggestion_id) maintained incrementally.

Elasticsearch/OpenSearch index with edge n-gram tokenizer on suggestion_text field:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "prefix_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "edge_ngram_filter"]
        }
      },
      "filter": {
        "edge_ngram_filter": { "type": "edge_ngram", "min_gram": 2, "max_gram": 20 }
      }
    }
  }
}

Why Elasticsearch vs custom trie service

ES: faster time-to-product, horizontal scaling, BM25 scoring hooks — CPU heavier per request.
Custom trie in C++/Rust: lowest latency, highest engineering cost — pick when p99 <10 ms hard requirement.

Personalization store

Redis sorted sets user:{id}:recent_queries with capped length 50 — intersect with global candidates at rank time (caching).

Analytics warehouse

BigQuery table raw_queries for nightly popularity aggregation feeding batch index builder.

Sample popularity row (offline)

normalized_query	impression_count_7d	last_updated
weather today	12000000	2026-04-29

5. High-Level Architecture

Loading diagram…

6. Component Deep-Dives

Serving tier: trie microservice

Sharded by first two letters of normalized query (36^2 buckets alnum) mapped via consistent hashing to pods — avoids single giant trie.
Memory: succinct trie (DAWG / Double-array trie) vs naive pointer trie — 5–10x space savings important at 100M nodes.
Update: blue/green load new immutable snapshot from S3 at version boundary — double buffering in process.

Elasticsearch path

Good for teams already operating ES; completion suggester API built-in but less flexible than custom ranking — combine with function_score for popularity.

Ranking fusion

Score = w1 * prefixScore + w2 * log(popularity) + w3 * personalBoost.
Learning to rank optional second stage with lightweight model on top 50 candidates only.

Client debouncing

50–100 ms debounce reduces wasteful requests 50%+ — product trade vs perceived responsiveness.

CDN caution

Personalized responses often uncacheable; cache only anonymous popular prefixes at edge with Vary on none — separate URL path /v1/suggest/public?q=.

Cold start

Warm connections HTTP/2 + gRPC keep-alive from gateway to trie shards.

Why not pure DB LIKE 'prefix%'

Relational index range scans on huge tables lag tens of ms under concurrent load — unacceptable for typeahead at scale.

7. Bottlenecks & Mitigations

Issue	Symptom	Mitigation
Hot prefix "we"	Few shards overloaded	Finer sharding; rate limit per IP; serve static top-10 from edge
Index rebuild storm	CPU spike	Canary snapshot rollout; async replica warming
ES heap pressure	GC pauses	Dedicated coordinating nodes; shrink shard count
Thundering herd new trending term	Cache miss stampede	Request coalescing singleflight per prefix in gateway (caching)
Unicode normalization	Missed matches	NFC normalize + lowercase pipeline consistent offline/online

8. Tradeoffs

Decision	Alternative	Why we picked
Custom trie service	ES only	Latency tail control at peak
Immutable snapshot deploy	Live incremental trie edits	Simplicity and predictable perf
Redis personal history	Only popular	Engagement lift vs privacy review
Edge cache public prefixes	Full dynamic	Cost and latency for anonymous
gRPC internal	REST	Less overhead high QPS
Sampled logging	Log all queries	Cost; still statistically robust trends

9. Follow-ups (interviewer drill-downs)

Typo tolerance? Separate SymSpell index or fuzzy on ES fuzziness: AUTO on top-K popular only — latency guard.
Multi-region: replicate read-only snapshots; last-writer-wins on version promotes every hour.
A/B ranking models: shadow traffic compare pCTR — feature flag in fusion service.
Children safe: blocklist trie overlay merged at rank stage.
Relate to news feed query box as downstream consumer of session context.

On this page