Design Notification System
1. Requirements
Functional
- Deliver user-targeted notifications across channels: mobile push (FCM/APNs), email (SMTP provider), SMS (Twilio-class), and optional in-app inbox.
- Support transactional triggers (password reset) and product events (new follower); optional scheduled campaigns with segmentation.
- Per-user channel preferences, quiet hours, frequency caps, and opt-in/out with legal compliance (unsubscribe).
- At-least-once delivery to external providers; dedupe visible duplicates within a time window.
- Track delivery state: queued, sent, delivered, opened (where channel supports), failed with reason.
Non-Functional
- Scale: 50M users; peak 5k notification fan-out requests/s into the system; providers absorb millions device deliveries with their own scale.
- Latency: enqueue p99 under 100 ms; push to device within seconds for real-time class; batch campaigns may defer minutes.
- Availability: 99.95% for ingestion API; downstream providers have separate SLAs.
- Consistency: preferences strongly consistent per user update; delivery logs eventual for analytics.
- Durability: no silent drop—every accepted notification must be persisted until acknowledged or dead-lettered.
Out of Scope
- Full marketing automation visual builder.
- On-device notification rendering and rich media composition beyond payload templates.
- In-house SMTP infrastructure (use SES/SendGrid).
- End-to-end encryption of notification body for push (payload visible to Google/Apple).
2. Back-of-Envelope Estimations
Assume 50M MAU, average 3 notification-worthy events per user per week ~ 21M notifications/week ~ 3M/day average.
-
Ingest: 3M / 86,400 ~ 35/s average; campaign spikes 500/s; major incidents 5k/s enqueue.
-
Fan-out: one event may target 1 user or 1M (broadcast). Segmented campaigns use precomputed audience batches—avoid naive O(N) joins at send time.
-
Storage: 3M rows/day * 2 KB metadata ~ 6 GB/day; 90-day retention ~ 550 GB in cold store + hot index.
-
Push provider egress: mostly outbound HTTPS to FCM/APNs; bandwidth dominated by JSON payloads—negligible vs image-heavy systems.
-
Queue depth: at 5k/s sustained for 60s = 300k messages; size ~ 600 MB if 2 KB each—Kafka partition planning needs headroom.
3. API Design
POST /v1/notifications
Authorization: Bearer <service-token>
Body: {
"templateId": "password_reset",
"userId": "u_123",
"channels": ["push", "email"],
"variables": { "resetLink": "https://..." },
"priority": "high",
"dedupeKey": "pwd-reset-u_123-20260429"
}
-> 202 { "notificationId": "n_abc", "status": "queued" }GET /v1/users/{userId}/notification-preferences
-> 200 { "push": true, "email": true, "sms": false, "quietHours": { "start": "22:00", "end": "07:00", "tz": "America/Los_Angeles" } }PATCH /v1/users/{userId}/notification-preferences
Body: { "push": false }
-> 204Worker internal
gRPC RenderAndDispatch(RenderRequest) -> includes resolved template, channel-specific payload bytes.Webhooks from providers:
POST /v1/webhooks/fcm
Body: { "messageId": "...", "event": "delivered" }4. Data Model
Entities
notification: id, user_id, template_id, channels[], variables JSON, status, created_at, dedupe_key (unique partial).delivery_attempt: notification_id, channel, provider_message_id, state, error_code, timestamp.user_preferences: user_id PK, flags per channel, quiet hours, locale.device_token: user_id, platform, token, last_seen, invalidated_at — for FCM/APNs mapping.
Database choice
- PostgreSQL for relational integrity between users and preferences; use JSONB for flexible quiet-hour rules.
- High write logs: partition
delivery_attemptby month or use Cassandra / BigQuery sink from Kafka for analytics scale.
Templates
- Store in Git-backed CMS or S3 versioned JSON + Handlebars; render service pulls by
templateIdversion — vs DB blobs for non-dev edits only.
Indexes
(user_id, created_at DESC)for inbox APIs.- Unique
(dedupe_key)where not null.
Why Kafka vs RabbitMQ for queue
- Kafka: replay, retention, multiple consumer groups (render vs analytics). Rabbit: classic task queues with lower ops for moderate volume — pick Kafka when replay and throughput dominate (message queues).
5. High-Level Architecture
6. Component Deep-Dives
Ingestion service
- Validates auth (service-to-service mTLS or OAuth2 client credentials).
- Loads preferences; if all channels off, short-circuit with audit log.
- Enforces dedupe:
INSERT ... ON CONFLICT DO NOTHINGondedupe_keythen skip dispatch. - Publishes compact event
{ notificationId, userId, templateId, version }to Kafka—not full variable payload if PII-heavy (load from secure store in worker).
Template rendering
- Handlebars vs Jinja (Python) — pick one stack; Java/Node Handlebars common in polyglot orgs.
- i18n: resolve locale from user profile; load string packs from Phrase/Lokalise or static bundles in S3.
- Push payload size limits: APNs ~4 KB; trim and link out to web.
Channel routers
- FCM HTTP v1 vs legacy—use v1 with service accounts and per-platform credential rotation.
- APNs HTTP/2 with JWT auth; maintain connection pools per worker to avoid handshake storms.
- SES vs SendGrid: SES cheaper at scale; SendGrid better deliverability UX for small teams—product decision.
Device token hygiene
- Prune on 410/404 from providers; periodic sweep stale tokens to avoid wasted provider calls.
In-app inbox
- Separate read path from push: store rendered summary rows in Postgres or Dynamo for mobile sync; caching of hot user inbox lists in Redis.
Failure handling
- Retry with exponential backoff per provider error class; circuit breaker per provider region (load balancing and health checks).
- Dead-letter queue topic in Kafka for manual replay after fix.
7. Bottlenecks & Mitigations
| Bottleneck | Cause | Mitigation |
|---|---|---|
| Kafka partition hotspot | One mega-campaign key | Partition by hash(userId) not campaign id; shard campaigns |
| Provider rate limits | FCM quotas | Token bucket per project in dispatcher; shard Firebase projects for mega apps |
| Template store outage | S3 blip | Cache last-known template version in Redis with TTL |
| Preference check storm | Viral event | Cache preferences 60s per user; invalidate on PATCH |
| Duplicate deliveries | At-least-once Kafka | Idempotent provider APIs where possible; dedupe_key in DB |
8. Tradeoffs
| Decision | Alternative | Why we picked |
|---|---|---|
| Kafka pipeline | SQS + Lambda | Replay and multi-consumer analytics |
| Postgres preferences | DynamoDB | Complex queries and joins with user account |
| Push via FCM/APNs direct | OneSignal abstract | Control cost and data residency; more integration work |
| Sync render in worker | Pre-render at enqueue | Worker pulls fresh prefs and tokens at send time |
| 202 Accepted enqueue | Synchronous send | User-facing APIs must not block on providers |
| Dead-letter Kafka topic | S3 only | Operational replay with ordering context |
9. Follow-ups (interviewer drill-downs)
- Order-sensitive notifications (OTP before marketing)? Separate topics with priority per SLA class.
- Exactly-once to user device? Impossible end-to-end; effectively-once UX via dedupe window and idempotent deep links.
- Multi-region: active-active ingestion with global Kafka (MirrorMaker 2) vs regional isolation with user affinity.
- A/B test copy: feature flag in render worker; metrics to warehouse.
- Rate limit abuse of
/v1/notifications? Mesh-level distributed rate limiter per calling service. - WhatsApp Business channel? Meta Cloud API add-on router with separate compliance storage.
Last updated on
Spotted something unclear or wrong on this page?