Design an Ad Click Aggregator & Reporting Pipeline

1. Requirements

Functional

Ingest high-volume ad click and impression events from edge servers and SDKs.
Deduplicate invalid/fraudulent clicks per policy; attribute clicks to campaign, creative, publisher, geo, device.
Serve near-real-time dashboards (minutes lag) and batch-accurate daily billing aggregates.
Expose APIs for partners to pull settlement reports and reconcile disputes (high level).
Optional approximate unique user counts for massive audiences (HyperLogLog-class sketches).

Non-Functional

Scale: millions to billions of events/hour at large networks; write-heavy append stream.
Latency: streaming path sub-minute rollups for ops; exact finance numbers T+1 or hourly batches.
Availability: 99.99% ingest; reporting may degrade to stale slices under failure.
Consistency: eventual for real-time tiles; strong for money-bound ledger handoff downstream (consistency).
Durability: immutable raw event log; compliance retention years.

Out of Scope

Full fraud ML models and chargeback legal process.
Real-time bidding exchange—only post-click aggregation assumed.
Creative asset hosting.

2. Back-of-Envelope Estimations

Assume 10B events/day (clicks + impressions mixed), ~500 bytes JSON protobuf each → ~5 TB/day raw upstream before compression; Kafka compression ~3–5×.

Peak QPS: 10B / 86400 ≈ 116k average; peaks 500k–1M events/s global during campaigns.
Storage: raw PB/year; aggregates far smaller (hourly cubes TBs).
Cardinality: billions of keys across dimensions—need sketches or sampling for uniques.
Serving: dashboard queries OLAP—10²–10⁴ QPS internal.
Cache: hot campaigns cached 80/20—Redis 10–100 GB per region for latest hour tiles.
Dimension explosion: A naive cube with campaign × geo × device × hour already spans billions of cells after one week of IDs—use restricted drill-down contracts for UI and rollup pruning so engineers do not accidentally GROUP BY high-cardinality IDs at serve time.
Correction volume: Expect 0.1–2% of events rewritten after fraud review—pipelines must apply signed deltas idempotently so nets stay stable under replay (idempotency).

Late events: Mobile clicks may arrive minutes late—watermarked streams must update hourly cubes or emit correction deltas to avoid silent under-reporting.

Finance alignment: Billing teams often need T+1 exact totals while ops wants minute bars—publish separate SLAs for streaming dashboards versus authoritative extracts to prevent endless reconciliation debates (consistency).

Skewed campaigns: A handful of advertisers can contribute most revenue—monitor per-campaign shard hotspots in OLAP storage and isolate noisy neighbors into dedicated clusters when upgrade budget allows.

3. API Design

POST /v1/events
Body: { eventType, campaignId, creativeId, publisherId, timestamp, clickId, ipHash, userPseudoId, bidRequestId }
-> 202 Accepted

GET /v1/reports/campaigns/{campaignId}/hourly?from=&to=
-> 200 { buckets: [{ hour, impressions, clicks, spend }] }

GET /v1/reports/reconcile/{date}/files
-> 200 { files: [{ url, checksum, rowCount }] }

Errors: 400 malformed, 401 signature, 409 duplicate clickId ignored idempotently (idempotency).

GET /v1/metrics/campaigns/{campaignId}/latency
-> 200 { ingestP95Ms, pipelineLagSeconds }

4. Data Model

RawEvent: append-only in object storage + Kafka topic retention window.
DedupeKey: clickId or (userPseudoId, creativeId, secondBucket) policy-dependent—Redis / RocksDB with TTL for windows (e.g., 7-day duplicate click window).
AggregateCube: dimensions (campaignId, hour, country, deviceType) → metrics { impressions, clicks, costMicros } in ClickHouse/Druid/Pinot.

OLAP columnar for serves; stream processor for incremental upserts. See sharding by campaignId for query locality.

5. High-Level Architecture

Loading diagram…

Two paths: streaming for ops dashboards with approximate or latency-bounded exact small windows; batch Iceberg/BigQuery rebuild for authoritative numbers. Kafka as central log (message queues).

6. Component Deep-Dives

Ingest: HTTP/gRPC collectors at edge push to regional Kafka; idempotency key clickId stored with TTL to drop duplicates.
Enrichment: Join geoIP, device graph side tables via async KV; late data routed to correction jobs.
Aggregation: Tumbling windows in Flink; watermarks handle skew; dedupe state RocksDB backed.
OLAP: Rollups materialized; drill-down hits raw sample for investigations only (cost guard).
Failure: At-least-once Kafka → idempotent sinks; replay from offset; late events adjust delta tables.

7. Bottlenecks & Mitigations

Hot campaigns: Shard aggregation state by campaignId; limit cardinality of group-bys in UI.
Skewed keys: Salting in intermediate map stages; two-phase reduce.
Dashboard thundering herd: Precompute top-N campaigns; request coalescing in API.
Backfill storms: Throttle reprocessing; priority queue for revenue-impacting jobs.
Privacy: IP truncated to /24 in cold storage; userPseudoId rotates per policy—aggregation must not re-identify via small multi-dimensional drill-downs.

8. Tradeoffs

Decision	Alternative	Why we picked
Kafka log backbone	Direct DB writes	Spikes isolation and replay
Approximate uniques	Exact distinct always	Cost at web scale
Lambda + Kappa variants	Pure batch	Ops latency needs
Columnar OLAP	Row SQL	Analytics slice performance

9. Follow-ups (interviewer drill-downs)

100× event volume? Partition Kafka; edge pre-aggregate where loss acceptable; regional hubs.
Exactly-once aggregates? Micro-batch transactions per partition offset + idempotent upsert—still define business idempotency (idempotency).
Migration? Dual-write OLAP stores; compare checksum reports per campaign-day.
Multi-region? Ingest local, global Kafka mirror or merge in batch layer with clock skew rules.
Cost? Tiered storage; sample raw for analytics; GPU only if ML fraud—not here.
Attribution windows? Last-click vs multi-touch models change reported ROI—pipeline should persist raw touch sequence for offline replays when finance changes rules.
GDPR / opt-out? Honored device flags suppress logging downstream; aggregates must still converge when users delete accounts—negative adjustment events in ledger-style fact tables.
Clock skew? Mobile devices backdate clicks—event time vs ingest time must both be stored so windows can choose business rules consistently (consistency between ops and finance).
Sampling for exploration? Product may accept 1% sampled deep dimensions for debug—clearly label dashboards approximate to avoid executive misreads.
Encryption at rest? Kafka and object store buckets hold PII hashes—CMK rotation playbooks are required before auditors show up, not after.
Replay attacks? Bad actors resubmit old events—include signed session nonces from collector so dedupe keys cannot be forged across days.
Vendor egress fees? Moving raw logs between clouds is expensive—compress aggressively and aggregate early when multi-cloud DR is policy.
Schema drift? Mobile SDKs ship weekly—version events and reject unknown required fields in prod only after shadow parse in canary regions (pub/sub replay tests).
Union of streams? Impression and click topics must join on stable ids—if ad server resets creatives mid flight, backfill mapping tables so historical joins stay true.

On this page