Bloom Filter

What it is

A Bloom filter is a probabilistic set membership structure: a bit array and several hash functions map each element to bit positions. If all those bits are set, the element is maybe in the set; if any bit is clear, the element is definitely not in the set.

Bloom filter membership check showing hash functions setting bits, definite negative answers, maybe positive verification, and false positive risk.

How the check works

On insert, hash the key k different ways and set those bit positions to 1.
On lookup, hash the key the same k ways and inspect the same positions.
If any position is 0, the key was never inserted because insertion would have set every required bit.
If all positions are 1, the key may be present, but those bits could have been set by other keys colliding across the array.

That is the whole one-sided contract: a clear bit proves absence; all set bits only create a candidate that may need verification.

False positive rate

No false negatives (if implemented correctly for insertions): if the filter says “not present,” it is not.
False positives are possible: the filter may say “maybe present” when the key was never inserted. False positive rate decreases with a larger bit array and a hash count tuned to the expected element count.

Too few hashes leave information unused; too many hashes set too many bits and make the filter saturate. A common approximation:

n = expected items
m = bits in the filter
k = number of hash functions

falsePositiveRate ~= (1 - e^(-k*n/m))^k
best k ~= (m/n) * ln(2)

Example: if n = 10M keys and you can spend about 10 bits per key, m = 100M bits (12.5 MB) and the optimal hash count is about 7. That gives roughly a 0.8% false-positive rate before checking the authoritative store.

If the target false-positive rate is known first, size memory with:

m ~= -(n * ln(p)) / (ln(2)^2)
k ~= (m/n) * ln(2)

Example: for 100M keys at 1% false positives, m is about 958M bits (120 MB) and k is about 7. For 0.1%, memory rises to about 180 MB. Lower false positives are not free; they buy fewer extra store lookups with more memory and CPU per query.

Use cases

Pre-check before expensive lookup: disk/SSD existence, caching layer “maybe in group,” Cassandra and HBase block filters to skip IO.
Distributed joins: send a compact filter for one side of a join so another service or shard can skip keys that definitely cannot match.
Crawler “seen URL” checks at huge scale with bounded memory.
One-sided cheap negative answers to save work.

Concrete example

Suppose a crawler has seen 10B URLs and cannot keep the exact set fully in memory on every worker. A Bloom filter lets each worker identify URLs that are definitely not in the seen set and send them to the authoritative store for insert/enqueue. When the filter says maybe present, the worker checks the authoritative URL store before dropping the URL, because a false positive would otherwise lose crawl coverage.

Classic Bloom filter: no delete

Standard Bloom filters do not support reliable delete without risking false negatives. Counting Bloom filters use counters per slot to allow delete at higher memory cost. Cuckoo filters are an alternative with delete support and good space efficiency in some workloads.

  insert(key):  for each hash: bit[hash(key, index)] = 1
  maybePresent(key):  all those bits are 1  ->  maybe
                        any bit 0            ->  definitely no

When to use

Huge universe of keys; acceptable occasional extra positives; need tiny memory vs exact set.

Alternatives

Hash set in memory: exact, but memory grows with every key.
Sketches (Count-Min, HyperLogLog) for different questions (frequency, cardinality).

Data structure comparison

Structure	Answers	Errors	Use when
Hash set	Exact membership	None if data fits and is available	Correctness needs exact answers and memory is acceptable
Bloom filter	Probable membership	False positives, no false negatives for inserts	Cheap negatives save expensive IO or network calls
Counting Bloom filter	Probable membership with delete	False positives plus counter overflow risk	Deletes are needed and extra memory is acceptable
Cuckoo filter	Probable membership with delete	False positives; insertion can fail near high load	You need delete with compact storage and manageable rebuilds
HyperLogLog	Approximate cardinality	Estimation error	You need "how many unique?", not "is this key present?"
Count-Min Sketch	Approximate frequency	Overestimates counts	You need "how often did this key appear?", not exact membership

Simpler example

In an LSM database or cache-heavy service, a Bloom filter can sit before a disk segment or remote store. If the filter says definitely not present, skip the expensive lookup. If it says maybe present, perform the real lookup and accept that some of those lookups will be wasted false positives.

Failure modes

Overfilled filter (too many keys for bit array size): FP rate becomes useless.
Assuming no FP in correctness-critical paths without verification step.
Delete on classic filter breaks correctness.
Wrong sizing assumption: expected cardinality grows but the filter is not rebuilt or rotated.
Poor hash independence: correlated hashes raise false positives beyond the estimate.

Common mistakes

Using a Bloom filter as the source of truth instead of a pre-check.
Forgetting the rebuild/rotation plan when the key population grows.
Applying it where false positives change correctness rather than only adding extra work.
Ignoring serialization/versioning when filters are shipped between services.

Operational signals

Estimated fill ratio, measured false-positive rate from sampled authoritative checks, lookup QPS saved, and authoritative lookup fallback rate.
Filter build time, distribution lag, memory footprint, and version mismatch across nodes.

Interview talking points

Always pair Bloom pre-check with authoritative store lookup on positives.
Explain hashes and bits; relate FP to array size and hash count; estimate memory with back-of-envelope.

Interview answer shape

State the problem: exact set membership is too expensive in memory or IO, and false positives are acceptable.
Explain the one-sided contract: definitely not present is final; maybe present must be verified if correctness matters.
Size it from expected cardinality, memory budget, and target false-positive rate.
Discuss lifecycle: rebuild/rotate when overfilled, distribute versions safely, and use counting Bloom or Cuckoo filters if delete is required.
Connect to system impact: fewer disk/store lookups, lower read amplification, but extra positives still hit the authoritative store.

Common follow-ups: why there are no false negatives, how false positives happen, how to choose m and k, delete support, overfilling, hash quality, and Bloom filter vs hash set vs HyperLogLog.

Follow-up answers

Why no false negatives? Insert sets every bit that lookup will later check for that key. If one checked bit is 0, no successful insert for that key happened in this filter version.
How false positives happen: different inserted keys can collectively set all positions for a never-inserted key, so lookup sees all 1s even though the exact set does not contain it.
How to choose m and k: start with expected item count n and acceptable false-positive rate p, compute memory m, then choose k ~= (m/n) * ln(2). Leave headroom or rotate before the filter exceeds planned cardinality.
Why delete is unsafe in classic Bloom filters: clearing a bit for one deleted key may clear a bit needed by another inserted key, creating a false negative.
When not to use it: do not use a Bloom filter as the final authority for permissions, payments, idempotency, or any path where a false positive changes correctness instead of only adding extra verification work.

Interview questions

Why can a Bloom filter have false positives but not false negatives for inserted keys?
Given n expected keys and a 1% false-positive target, how would you estimate memory and hash count?
Why is a Bloom filter unsafe as the only check before dropping a URL, payment id, or permission record?

Memory hooks

not present is final; maybe present needs verification when correctness matters.
Bigger m reduces false positives; too many hashes can saturate the filter.
Bloom filters answer membership, not counts or frequencies.

Back-of-the-envelope estimation
Latency and throughput — fewer authoritative lookups improves read path throughput

On this page