THN Interview Prep

Design an Ad Click Aggregator & Reporting Pipeline

1. Requirements

Functional

  • Ingest high-volume ad click and impression events from edge servers and SDKs.
  • Deduplicate invalid/fraudulent clicks per policy; attribute clicks to campaign, creative, publisher, geo, device.
  • Serve near-real-time dashboards (minutes lag) and batch-accurate daily billing aggregates.
  • Expose APIs for partners to pull settlement reports and reconcile disputes (high level).
  • Optional approximate unique user counts for massive audiences (HyperLogLog-class sketches).

Non-Functional

  • Scale: millions to billions of events/hour at large networks; write-heavy append stream.
  • Latency: streaming path sub-minute rollups for ops; exact finance numbers T+1 or hourly batches.
  • Availability: 99.99% ingest; reporting may degrade to stale slices under failure.
  • Consistency: eventual for real-time tiles; strong for money-bound ledger handoff downstream (consistency).
  • Durability: immutable raw event log; compliance retention years.

Out of Scope

  • Full fraud ML models and chargeback legal process.
  • Real-time bidding exchange—only post-click aggregation assumed.
  • Creative asset hosting.

2. Back-of-Envelope Estimations

Assume 10B events/day (clicks + impressions mixed), ~500 bytes JSON protobuf each → ~5 TB/day raw upstream before compression; Kafka compression ~3–5×.

  • Peak QPS: 10B / 86400 ≈ 116k average; peaks 500k–1M events/s global during campaigns.

  • Storage: raw PB/year; aggregates far smaller (hourly cubes TBs).

  • Cardinality: billions of keys across dimensions—need sketches or sampling for uniques.

  • Serving: dashboard queries OLAP10²–10⁴ QPS internal.

  • Cache: hot campaigns cached 80/20Redis 10–100 GB per region for latest hour tiles.

  • Dimension explosion: A naive cube with campaign × geo × device × hour already spans billions of cells after one week of IDs—use restricted drill-down contracts for UI and rollup pruning so engineers do not accidentally GROUP BY high-cardinality IDs at serve time.

  • Correction volume: Expect 0.1–2% of events rewritten after fraud review—pipelines must apply signed deltas idempotently so nets stay stable under replay (idempotency).

Late events: Mobile clicks may arrive minutes late—watermarked streams must update hourly cubes or emit correction deltas to avoid silent under-reporting.

Finance alignment: Billing teams often need T+1 exact totals while ops wants minute bars—publish separate SLAs for streaming dashboards versus authoritative extracts to prevent endless reconciliation debates (consistency).

Skewed campaigns: A handful of advertisers can contribute most revenue—monitor per-campaign shard hotspots in OLAP storage and isolate noisy neighbors into dedicated clusters when upgrade budget allows.

3. API Design

POST /v1/events
Body: { eventType, campaignId, creativeId, publisherId, timestamp, clickId, ipHash, userPseudoId, bidRequestId }
-> 202 Accepted

GET /v1/reports/campaigns/{campaignId}/hourly?from=&to=
-> 200 { buckets: [{ hour, impressions, clicks, spend }] }

GET /v1/reports/reconcile/{date}/files
-> 200 { files: [{ url, checksum, rowCount }] }

Errors: 400 malformed, 401 signature, 409 duplicate clickId ignored idempotently (idempotency).

GET /v1/metrics/campaigns/{campaignId}/latency
-> 200 { ingestP95Ms, pipelineLagSeconds }

4. Data Model

  • RawEvent: append-only in object storage + Kafka topic retention window.
  • DedupeKey: clickId or (userPseudoId, creativeId, secondBucket) policy-dependent—Redis / RocksDB with TTL for windows (e.g., 7-day duplicate click window).
  • AggregateCube: dimensions (campaignId, hour, country, deviceType) → metrics { impressions, clicks, costMicros } in ClickHouse/Druid/Pinot.

OLAP columnar for serves; stream processor for incremental upserts. See sharding by campaignId for query locality.

5. High-Level Architecture

Loading diagram…

Two paths: streaming for ops dashboards with approximate or latency-bounded exact small windows; batch Iceberg/BigQuery rebuild for authoritative numbers. Kafka as central log (message queues).

6. Component Deep-Dives

  • Ingest: HTTP/gRPC collectors at edge push to regional Kafka; idempotency key clickId stored with TTL to drop duplicates.
  • Enrichment: Join geoIP, device graph side tables via async KV; late data routed to correction jobs.
  • Aggregation: Tumbling windows in Flink; watermarks handle skew; dedupe state RocksDB backed.
  • OLAP: Rollups materialized; drill-down hits raw sample for investigations only (cost guard).
  • Failure: At-least-once Kafka → idempotent sinks; replay from offset; late events adjust delta tables.

7. Bottlenecks & Mitigations

  • Hot campaigns: Shard aggregation state by campaignId; limit cardinality of group-bys in UI.

  • Skewed keys: Salting in intermediate map stages; two-phase reduce.

  • Dashboard thundering herd: Precompute top-N campaigns; request coalescing in API.

  • Backfill storms: Throttle reprocessing; priority queue for revenue-impacting jobs.

  • Privacy: IP truncated to /24 in cold storage; userPseudoId rotates per policy—aggregation must not re-identify via small multi-dimensional drill-downs.

8. Tradeoffs

DecisionAlternativeWhy we picked
Kafka log backboneDirect DB writesSpikes isolation and replay
Approximate uniquesExact distinct alwaysCost at web scale
Lambda + Kappa variantsPure batchOps latency needs
Columnar OLAPRow SQLAnalytics slice performance

9. Follow-ups (interviewer drill-downs)

  • 100× event volume? Partition Kafka; edge pre-aggregate where loss acceptable; regional hubs.

  • Exactly-once aggregates? Micro-batch transactions per partition offset + idempotent upsert—still define business idempotency (idempotency).

  • Migration? Dual-write OLAP stores; compare checksum reports per campaign-day.

  • Multi-region? Ingest local, global Kafka mirror or merge in batch layer with clock skew rules.

  • Cost? Tiered storage; sample raw for analytics; GPU only if ML fraud—not here.

  • Attribution windows? Last-click vs multi-touch models change reported ROIpipeline should persist raw touch sequence for offline replays when finance changes rules.

  • GDPR / opt-out? Honored device flags suppress logging downstream; aggregates must still converge when users delete accountsnegative adjustment events in ledger-style fact tables.

  • Clock skew? Mobile devices backdate clicks—event time vs ingest time must both be stored so windows can choose business rules consistently (consistency between ops and finance).

  • Sampling for exploration? Product may accept 1% sampled deep dimensions for debug—clearly label dashboards approximate to avoid executive misreads.

  • Encryption at rest? Kafka and object store buckets hold PII hashesCMK rotation playbooks are required before auditors show up, not after.

  • Replay attacks? Bad actors resubmit old events—include signed session nonces from collector so dedupe keys cannot be forged across days.

  • Vendor egress fees? Moving raw logs between clouds is expensivecompress aggressively and aggregate early when multi-cloud DR is policy.

  • Schema drift? Mobile SDKs ship weekly—version events and reject unknown required fields in prod only after shadow parse in canary regions (pub/sub replay tests).

  • Union of streams? Impression and click topics must join on stable ids—if ad server resets creatives mid flight, backfill mapping tables so historical joins stay true.

Last updated on

Spotted something unclear or wrong on this page?

On this page