THN Interview Prep

Design Notification System

1. Requirements

Functional

  • Deliver user-targeted notifications across channels: mobile push (FCM/APNs), email (SMTP provider), SMS (Twilio-class), and optional in-app inbox.
  • Support transactional triggers (password reset) and product events (new follower); optional scheduled campaigns with segmentation.
  • Per-user channel preferences, quiet hours, frequency caps, and opt-in/out with legal compliance (unsubscribe).
  • At-least-once delivery to external providers; dedupe visible duplicates within a time window.
  • Track delivery state: queued, sent, delivered, opened (where channel supports), failed with reason.

Non-Functional

  • Scale: 50M users; peak 5k notification fan-out requests/s into the system; providers absorb millions device deliveries with their own scale.
  • Latency: enqueue p99 under 100 ms; push to device within seconds for real-time class; batch campaigns may defer minutes.
  • Availability: 99.95% for ingestion API; downstream providers have separate SLAs.
  • Consistency: preferences strongly consistent per user update; delivery logs eventual for analytics.
  • Durability: no silent drop—every accepted notification must be persisted until acknowledged or dead-lettered.

Out of Scope

  • Full marketing automation visual builder.
  • On-device notification rendering and rich media composition beyond payload templates.
  • In-house SMTP infrastructure (use SES/SendGrid).
  • End-to-end encryption of notification body for push (payload visible to Google/Apple).

2. Back-of-Envelope Estimations

Assume 50M MAU, average 3 notification-worthy events per user per week ~ 21M notifications/week ~ 3M/day average.

  • Ingest: 3M / 86,400 ~ 35/s average; campaign spikes 500/s; major incidents 5k/s enqueue.

  • Fan-out: one event may target 1 user or 1M (broadcast). Segmented campaigns use precomputed audience batches—avoid naive O(N) joins at send time.

  • Storage: 3M rows/day * 2 KB metadata ~ 6 GB/day; 90-day retention ~ 550 GB in cold store + hot index.

  • Push provider egress: mostly outbound HTTPS to FCM/APNs; bandwidth dominated by JSON payloads—negligible vs image-heavy systems.

  • Queue depth: at 5k/s sustained for 60s = 300k messages; size ~ 600 MB if 2 KB each—Kafka partition planning needs headroom.


3. API Design

POST /v1/notifications
Authorization: Bearer <service-token>
Body: {
  "templateId": "password_reset",
  "userId": "u_123",
  "channels": ["push", "email"],
  "variables": { "resetLink": "https://..." },
  "priority": "high",
  "dedupeKey": "pwd-reset-u_123-20260429"
}
-> 202 { "notificationId": "n_abc", "status": "queued" }
GET /v1/users/{userId}/notification-preferences
-> 200 { "push": true, "email": true, "sms": false, "quietHours": { "start": "22:00", "end": "07:00", "tz": "America/Los_Angeles" } }
PATCH /v1/users/{userId}/notification-preferences
Body: { "push": false }
-> 204

Worker internal

gRPC RenderAndDispatch(RenderRequest) -> includes resolved template, channel-specific payload bytes.

Webhooks from providers:

POST /v1/webhooks/fcm
Body: { "messageId": "...", "event": "delivered" }

4. Data Model

Entities

  • notification: id, user_id, template_id, channels[], variables JSON, status, created_at, dedupe_key (unique partial).
  • delivery_attempt: notification_id, channel, provider_message_id, state, error_code, timestamp.
  • user_preferences: user_id PK, flags per channel, quiet hours, locale.
  • device_token: user_id, platform, token, last_seen, invalidated_at — for FCM/APNs mapping.

Database choice

  • PostgreSQL for relational integrity between users and preferences; use JSONB for flexible quiet-hour rules.
  • High write logs: partition delivery_attempt by month or use Cassandra / BigQuery sink from Kafka for analytics scale.

Templates

  • Store in Git-backed CMS or S3 versioned JSON + Handlebars; render service pulls by templateId version — vs DB blobs for non-dev edits only.

Indexes

  • (user_id, created_at DESC) for inbox APIs.
  • Unique (dedupe_key) where not null.

Why Kafka vs RabbitMQ for queue

  • Kafka: replay, retention, multiple consumer groups (render vs analytics). Rabbit: classic task queues with lower ops for moderate volume — pick Kafka when replay and throughput dominate (message queues).

5. High-Level Architecture

Loading diagram…

6. Component Deep-Dives

Ingestion service

  • Validates auth (service-to-service mTLS or OAuth2 client credentials).
  • Loads preferences; if all channels off, short-circuit with audit log.
  • Enforces dedupe: INSERT ... ON CONFLICT DO NOTHING on dedupe_key then skip dispatch.
  • Publishes compact event { notificationId, userId, templateId, version } to Kafka—not full variable payload if PII-heavy (load from secure store in worker).

Template rendering

  • Handlebars vs Jinja (Python) — pick one stack; Java/Node Handlebars common in polyglot orgs.
  • i18n: resolve locale from user profile; load string packs from Phrase/Lokalise or static bundles in S3.
  • Push payload size limits: APNs ~4 KB; trim and link out to web.

Channel routers

  • FCM HTTP v1 vs legacy—use v1 with service accounts and per-platform credential rotation.
  • APNs HTTP/2 with JWT auth; maintain connection pools per worker to avoid handshake storms.
  • SES vs SendGrid: SES cheaper at scale; SendGrid better deliverability UX for small teams—product decision.

Device token hygiene

  • Prune on 410/404 from providers; periodic sweep stale tokens to avoid wasted provider calls.

In-app inbox

  • Separate read path from push: store rendered summary rows in Postgres or Dynamo for mobile sync; caching of hot user inbox lists in Redis.

Failure handling

  • Retry with exponential backoff per provider error class; circuit breaker per provider region (load balancing and health checks).
  • Dead-letter queue topic in Kafka for manual replay after fix.

7. Bottlenecks & Mitigations

BottleneckCauseMitigation
Kafka partition hotspotOne mega-campaign keyPartition by hash(userId) not campaign id; shard campaigns
Provider rate limitsFCM quotasToken bucket per project in dispatcher; shard Firebase projects for mega apps
Template store outageS3 blipCache last-known template version in Redis with TTL
Preference check stormViral eventCache preferences 60s per user; invalidate on PATCH
Duplicate deliveriesAt-least-once KafkaIdempotent provider APIs where possible; dedupe_key in DB

8. Tradeoffs

DecisionAlternativeWhy we picked
Kafka pipelineSQS + LambdaReplay and multi-consumer analytics
Postgres preferencesDynamoDBComplex queries and joins with user account
Push via FCM/APNs directOneSignal abstractControl cost and data residency; more integration work
Sync render in workerPre-render at enqueueWorker pulls fresh prefs and tokens at send time
202 Accepted enqueueSynchronous sendUser-facing APIs must not block on providers
Dead-letter Kafka topicS3 onlyOperational replay with ordering context

9. Follow-ups (interviewer drill-downs)

  • Order-sensitive notifications (OTP before marketing)? Separate topics with priority per SLA class.
  • Exactly-once to user device? Impossible end-to-end; effectively-once UX via dedupe window and idempotent deep links.
  • Multi-region: active-active ingestion with global Kafka (MirrorMaker 2) vs regional isolation with user affinity.
  • A/B test copy: feature flag in render worker; metrics to warehouse.
  • Rate limit abuse of /v1/notifications? Mesh-level distributed rate limiter per calling service.
  • WhatsApp Business channel? Meta Cloud API add-on router with separate compliance storage.

Last updated on

Spotted something unclear or wrong on this page?

On this page