THN Interview Prep

Bloom Filter

What it is

A Bloom filter is a probabilistic set membership structure: a bit array and several hash functions map each element to bit positions. If all those bits are set, the element is maybe in the set; if any bit is clear, the element is definitely not in the set.

Bloom filter membership check showing hash functions setting bits, definite negative answers, maybe positive verification, and false positive risk.

How the check works

  1. On insert, hash the key k different ways and set those bit positions to 1.
  2. On lookup, hash the key the same k ways and inspect the same positions.
  3. If any position is 0, the key was never inserted because insertion would have set every required bit.
  4. If all positions are 1, the key may be present, but those bits could have been set by other keys colliding across the array.

That is the whole one-sided contract: a clear bit proves absence; all set bits only create a candidate that may need verification.

False positive rate

  • No false negatives (if implemented correctly for insertions): if the filter says “not present,” it is not.
  • False positives are possible: the filter may say “maybe present” when the key was never inserted. False positive rate decreases with a larger bit array and a hash count tuned to the expected element count.

Too few hashes leave information unused; too many hashes set too many bits and make the filter saturate. A common approximation:

n = expected items
m = bits in the filter
k = number of hash functions

falsePositiveRate ~= (1 - e^(-k*n/m))^k
best k ~= (m/n) * ln(2)

Example: if n = 10M keys and you can spend about 10 bits per key, m = 100M bits (12.5 MB) and the optimal hash count is about 7. That gives roughly a 0.8% false-positive rate before checking the authoritative store.

If the target false-positive rate is known first, size memory with:

m ~= -(n * ln(p)) / (ln(2)^2)
k ~= (m/n) * ln(2)

Example: for 100M keys at 1% false positives, m is about 958M bits (120 MB) and k is about 7. For 0.1%, memory rises to about 180 MB. Lower false positives are not free; they buy fewer extra store lookups with more memory and CPU per query.

Use cases

  • Pre-check before expensive lookup: disk/SSD existence, caching layer “maybe in group,” Cassandra and HBase block filters to skip IO.
  • Distributed joins: send a compact filter for one side of a join so another service or shard can skip keys that definitely cannot match.
  • Crawler “seen URL” checks at huge scale with bounded memory.
  • One-sided cheap negative answers to save work.

Concrete example

Suppose a crawler has seen 10B URLs and cannot keep the exact set fully in memory on every worker. A Bloom filter lets each worker identify URLs that are definitely not in the seen set and send them to the authoritative store for insert/enqueue. When the filter says maybe present, the worker checks the authoritative URL store before dropping the URL, because a false positive would otherwise lose crawl coverage.

Classic Bloom filter: no delete

Standard Bloom filters do not support reliable delete without risking false negatives. Counting Bloom filters use counters per slot to allow delete at higher memory cost. Cuckoo filters are an alternative with delete support and good space efficiency in some workloads.

  insert(key):  for each hash: bit[hash(key, index)] = 1
  maybePresent(key):  all those bits are 1  ->  maybe
                        any bit 0            ->  definitely no

When to use

  • Huge universe of keys; acceptable occasional extra positives; need tiny memory vs exact set.

Alternatives

  • Hash set in memory: exact, but memory grows with every key.
  • Sketches (Count-Min, HyperLogLog) for different questions (frequency, cardinality).

Data structure comparison

StructureAnswersErrorsUse when
Hash setExact membershipNone if data fits and is availableCorrectness needs exact answers and memory is acceptable
Bloom filterProbable membershipFalse positives, no false negatives for insertsCheap negatives save expensive IO or network calls
Counting Bloom filterProbable membership with deleteFalse positives plus counter overflow riskDeletes are needed and extra memory is acceptable
Cuckoo filterProbable membership with deleteFalse positives; insertion can fail near high loadYou need delete with compact storage and manageable rebuilds
HyperLogLogApproximate cardinalityEstimation errorYou need "how many unique?", not "is this key present?"
Count-Min SketchApproximate frequencyOverestimates countsYou need "how often did this key appear?", not exact membership

Simpler example

In an LSM database or cache-heavy service, a Bloom filter can sit before a disk segment or remote store. If the filter says definitely not present, skip the expensive lookup. If it says maybe present, perform the real lookup and accept that some of those lookups will be wasted false positives.

Failure modes

  • Overfilled filter (too many keys for bit array size): FP rate becomes useless.
  • Assuming no FP in correctness-critical paths without verification step.
  • Delete on classic filter breaks correctness.
  • Wrong sizing assumption: expected cardinality grows but the filter is not rebuilt or rotated.
  • Poor hash independence: correlated hashes raise false positives beyond the estimate.

Common mistakes

  • Using a Bloom filter as the source of truth instead of a pre-check.
  • Forgetting the rebuild/rotation plan when the key population grows.
  • Applying it where false positives change correctness rather than only adding extra work.
  • Ignoring serialization/versioning when filters are shipped between services.

Operational signals

  • Estimated fill ratio, measured false-positive rate from sampled authoritative checks, lookup QPS saved, and authoritative lookup fallback rate.
  • Filter build time, distribution lag, memory footprint, and version mismatch across nodes.

Interview talking points

  • Always pair Bloom pre-check with authoritative store lookup on positives.
  • Explain hashes and bits; relate FP to array size and hash count; estimate memory with back-of-envelope.

Interview answer shape

  1. State the problem: exact set membership is too expensive in memory or IO, and false positives are acceptable.
  2. Explain the one-sided contract: definitely not present is final; maybe present must be verified if correctness matters.
  3. Size it from expected cardinality, memory budget, and target false-positive rate.
  4. Discuss lifecycle: rebuild/rotate when overfilled, distribute versions safely, and use counting Bloom or Cuckoo filters if delete is required.
  5. Connect to system impact: fewer disk/store lookups, lower read amplification, but extra positives still hit the authoritative store.

Common follow-ups: why there are no false negatives, how false positives happen, how to choose m and k, delete support, overfilling, hash quality, and Bloom filter vs hash set vs HyperLogLog.

Follow-up answers

  • Why no false negatives? Insert sets every bit that lookup will later check for that key. If one checked bit is 0, no successful insert for that key happened in this filter version.
  • How false positives happen: different inserted keys can collectively set all positions for a never-inserted key, so lookup sees all 1s even though the exact set does not contain it.
  • How to choose m and k: start with expected item count n and acceptable false-positive rate p, compute memory m, then choose k ~= (m/n) * ln(2). Leave headroom or rotate before the filter exceeds planned cardinality.
  • Why delete is unsafe in classic Bloom filters: clearing a bit for one deleted key may clear a bit needed by another inserted key, creating a false negative.
  • When not to use it: do not use a Bloom filter as the final authority for permissions, payments, idempotency, or any path where a false positive changes correctness instead of only adding extra verification work.

Interview questions

  1. Why can a Bloom filter have false positives but not false negatives for inserted keys?
  2. Given n expected keys and a 1% false-positive target, how would you estimate memory and hash count?
  3. Why is a Bloom filter unsafe as the only check before dropping a URL, payment id, or permission record?

Memory hooks

  • not present is final; maybe present needs verification when correctness matters.
  • Bigger m reduces false positives; too many hashes can saturate the filter.
  • Bloom filters answer membership, not counts or frequencies.

Mark this page when you finish learning it.

Last updated on

Spotted something unclear or wrong on this page?

On this page