Design Notification System

1. Requirements

Functional

Deliver user-targeted notifications across channels: mobile push (FCM/APNs), email (SMTP provider), SMS (Twilio-class), and optional in-app inbox.
Support transactional triggers (password reset) and product events (new follower); optional scheduled campaigns with segmentation.
Per-user channel preferences, quiet hours, frequency caps, and opt-in/out with legal compliance (unsubscribe).
At-least-once delivery to external providers; dedupe visible duplicates within a time window.
Track delivery state: queued, sent, delivered, opened (where channel supports), failed with reason.

Non-Functional

Scale: 50M users; peak 5k notification fan-out requests/s into the system; providers absorb millions device deliveries with their own scale.
Latency: enqueue p99 under 100 ms; push to device within seconds for real-time class; batch campaigns may defer minutes.
Availability: 99.95% for ingestion API; downstream providers have separate SLAs.
Consistency: preferences strongly consistent per user update; delivery logs eventual for analytics.
Durability: no silent drop—every accepted notification must be persisted until acknowledged or dead-lettered.

Out of Scope

Full marketing automation visual builder.
On-device notification rendering and rich media composition beyond payload templates.
In-house SMTP infrastructure (use SES/SendGrid).
End-to-end encryption of notification body for push (payload visible to Google/Apple).

2. Back-of-Envelope Estimations

Assume 50M MAU, average 3 notification-worthy events per user per week ~ 21M notifications/week ~ 3M/day average.

Ingest: 3M / 86,400 ~ 35/s average; campaign spikes 500/s; major incidents 5k/s enqueue.
Fan-out: one event may target 1 user or 1M (broadcast). Segmented campaigns use precomputed audience batches—avoid naive O(N) joins at send time.
Storage: 3M rows/day * 2 KB metadata ~ 6 GB/day; 90-day retention ~ 550 GB in cold store + hot index.
Push provider egress: mostly outbound HTTPS to FCM/APNs; bandwidth dominated by JSON payloads—negligible vs image-heavy systems.
Queue depth: at 5k/s sustained for 60s = 300k messages; size ~ 600 MB if 2 KB each—Kafka partition planning needs headroom.

3. API Design

POST /v1/notifications
Authorization: Bearer <service-token>
Body: {
  "templateId": "password_reset",
  "userId": "u_123",
  "channels": ["push", "email"],
  "variables": { "resetLink": "https://..." },
  "priority": "high",
  "dedupeKey": "pwd-reset-u_123-20260429"
}
-> 202 { "notificationId": "n_abc", "status": "queued" }

GET /v1/users/{userId}/notification-preferences
-> 200 { "push": true, "email": true, "sms": false, "quietHours": { "start": "22:00", "end": "07:00", "tz": "America/Los_Angeles" } }

PATCH /v1/users/{userId}/notification-preferences
Body: { "push": false }
-> 204

Worker internal

gRPC RenderAndDispatch(RenderRequest) -> includes resolved template, channel-specific payload bytes.

Webhooks from providers:

POST /v1/webhooks/fcm
Body: { "messageId": "...", "event": "delivered" }

4. Data Model

Entities

notification: id, user_id, template_id, channels[], variables JSON, status, created_at, dedupe_key (unique partial).
delivery_attempt: notification_id, channel, provider_message_id, state, error_code, timestamp.
user_preferences: user_id PK, flags per channel, quiet hours, locale.
device_token: user_id, platform, token, last_seen, invalidated_at — for FCM/APNs mapping.

Database choice

PostgreSQL for relational integrity between users and preferences; use JSONB for flexible quiet-hour rules.
High write logs: partition delivery_attempt by month or use Cassandra / BigQuery sink from Kafka for analytics scale.

Templates

Store in Git-backed CMS or S3 versioned JSON + Handlebars; render service pulls by templateId version — vs DB blobs for non-dev edits only.

Indexes

(user_id, created_at DESC) for inbox APIs.
Unique (dedupe_key) where not null.

Why Kafka vs RabbitMQ for queue

Kafka: replay, retention, multiple consumer groups (render vs analytics). Rabbit: classic task queues with lower ops for moderate volume — pick Kafka when replay and throughput dominate (message queues).

5. High-Level Architecture

Notification delivery pipeline showing ingestion, preference lookup, Kafka topics, render workers, channel routers, providers, DLQ, and analytics sink.

Loading diagram…

6. Component Deep-Dives

Ingestion service

Validates auth (service-to-service mTLS or OAuth2 client credentials).
Loads preferences; if all channels off, short-circuit with audit log.
Enforces dedupe: INSERT ... ON CONFLICT DO NOTHING on dedupe_key then skip dispatch.
Publishes compact event { notificationId, userId, templateId, version } to Kafka—not full variable payload if PII-heavy (load from secure store in worker).

Template rendering

Handlebars vs Jinja (Python) — pick one stack; Java/Node Handlebars common in polyglot orgs.
i18n: resolve locale from user profile; load string packs from Phrase/Lokalise or static bundles in S3.
Push payload size limits: APNs ~4 KB; trim and link out to web.

Channel routers

FCM HTTP v1 vs legacy—use v1 with service accounts and per-platform credential rotation.
APNs HTTP/2 with JWT auth; maintain connection pools per worker to avoid handshake storms.
SES vs SendGrid: SES cheaper at scale; SendGrid better deliverability UX for small teams—product decision.

Device token hygiene

Prune on 410/404 from providers; periodic sweep stale tokens to avoid wasted provider calls.

In-app inbox

Separate read path from push: store rendered summary rows in Postgres or Dynamo for mobile sync; caching of hot user inbox lists in Redis.

Failure handling

Retry with exponential backoff per provider error class; circuit breaker per provider region (load balancing and health checks).
Dead-letter queue topic in Kafka for manual replay after fix.

7. Bottlenecks & Mitigations

Bottleneck	Cause	Mitigation
Kafka partition hotspot	One mega-campaign key	Partition by `hash(userId)` not campaign id; shard campaigns
Provider rate limits	FCM quotas	Token bucket per project in dispatcher; shard Firebase projects for mega apps
Template store outage	S3 blip	Cache last-known template version in Redis with TTL
Preference check storm	Viral event	Cache preferences 60s per user; invalidate on PATCH
Duplicate deliveries	At-least-once Kafka	Idempotent provider APIs where possible; dedupe_key in DB

8. Tradeoffs

Decision	Alternative	Why we picked
Kafka pipeline	SQS + Lambda	Replay and multi-consumer analytics
Postgres preferences	DynamoDB	Complex queries and joins with user account
Push via FCM/APNs direct	OneSignal abstract	Control cost and data residency; more integration work
Sync render in worker	Pre-render at enqueue	Worker pulls fresh prefs and tokens at send time
202 Accepted enqueue	Synchronous send	User-facing APIs must not block on providers
Dead-letter Kafka topic	S3 only	Operational replay with ordering context

9. Follow-ups (interviewer drill-downs)

Order-sensitive notifications (OTP before marketing)? Separate topics with priority per SLA class.
Exactly-once to user device? Impossible end-to-end; effectively-once UX via dedupe window and idempotent deep links.
Multi-region: active-active ingestion with global Kafka (MirrorMaker 2) vs regional isolation with user affinity.
A/B test copy: feature flag in render worker; metrics to warehouse.
Rate limit abuse of /v1/notifications? Mesh-level distributed rate limiter per calling service.
WhatsApp Business channel? Meta Cloud API add-on router with separate compliance storage.

On this page