Bloom Filter
What it is
A Bloom filter is a probabilistic set membership structure: a bit array and several hash functions map each element to bit positions. If all those bits are set, the element is maybe in the set; if any bit is clear, the element is definitely not in the set.
False positive rate
- No false negatives (if implemented correctly for insertions): if the filter says “not present,” it is not.
- False positives are possible: the filter may say “maybe present” when the key was never inserted. False positive rate decreases with larger bit array and more hash functions (tuned to expected element count).
Given approximate expectedElementCount and bitArraySize, the false positive rate can be approximated; practice often uses online calculators or library defaults.
Use cases
- Pre-check before expensive lookup: disk/SSD existence, caching layer “maybe in group,” Cassandra and HBase block filters to skip IO.
- Join and crawler “seen URL” at huge scale with bounded memory.
- One-sided cheap negative answers to save work.
Classic Bloom filter: no delete
Standard Bloom filters do not support reliable delete without risking false negatives. Counting Bloom filters use counters per slot to allow delete at higher memory cost. Cuckoo filters are an alternative with delete support and good space efficiency in some workloads.
insert(key): for each hash: bit[hash(key, index)] = 1
maybePresent(key): all those bits are 1 -> maybe
any bit 0 -> definitely noWhen to use
- Huge universe of keys; acceptable occasional extra positives; need tiny memory vs exact set.
Alternatives
- Hash set in memory: exact, but memory grows with every key.
- Sketches (Count-Min, HyperLogLog) for different questions (frequency, cardinality).
Failure modes
- Overfilled filter (too many keys for bit array size): FP rate becomes useless.
- Assuming no FP in correctness-critical paths without verification step.
- Delete on classic filter breaks correctness.
Interview talking points
- Always pair Bloom pre-check with authoritative store lookup on positives.
- Explain hashes and bits; relate FP to array size and hash count; estimate memory with back-of-envelope.
Related fundamentals
- Back-of-the-envelope estimation
- Latency and throughput — fewer authoritative lookups improves read path throughput
Last updated on
Spotted something unclear or wrong on this page?