Notice Period
system design · week 5

Design a Notification System

The canonical queue question — decoupling, retries, and idempotency in one design.

You're asked: "Design a system that sends notifications — push, SMS, and email — to millions of users." Order shipped, friend requested, password reset: every product needs this, which is why every interview loop has it.

What makes this question great: the final architecture is the textbook queue-centric design, and the road there teaches the two morals you'll reuse in half of all system design answers — queues decouple producers from slow consumers, and at-least-once delivery plus idempotency beats chasing exactly-once.

Hidden trap to spot early: you don't actually deliver anything. Apple, Google, Twilio, and your email provider do. Your system's real job is to reliably hand work to third parties you don't control — slow, flaky, rate-limited third parties. Design for that and everything falls into place.

01Requirements: channels, triggers, guarantees

notify(user, event)Product Servicesorder, auth, social…Notification Service???APNs / FCMpushTwilioSMSSESemail

Scope it out loud before drawing anything.

Functional

  • Channels: push (iOS/Android), SMS, email — and a user can receive one event on several channels.
  • Triggers: other services fire events ("order #123 shipped"); also bulk/scheduled sends (marketing campaign to 10M users at 9am).
  • Preferences & opt-outs: users mute channels or categories. For SMS/email this is law (TCPA, CAN-SPAM, GDPR) — say the word "compliance" and the interviewer will visibly relax.

Non-functional

  • No lost notifications: a missing "password reset" email is a support ticket; a missing "payment failed" is real damage. Target at-least-once delivery.
  • At-least-once implies possible duplicates — we'll need to handle them deliberately (stage 4).
  • Soft real-time: seconds of delay is fine; minutes for marketing. Nobody needs single-digit-millisecond notifications — which buys us the freedom to go async.
  • Scale: say 10M notifications/day baseline ≈ ~115/sec average, but bursts are the real sizing input — a campaign can dump millions into the system in one minute. Spiky write load is the defining shape of this problem.

Say this out loud: "I'll guarantee at-least-once delivery and handle duplicates explicitly — exactly-once to a third-party provider is not achievable, so I won't pretend otherwise." That single sentence front-loads the most senior insight in this question.

The diagram starts honestly: services that need to notify, a notification service in the middle (contents TBD), and the third-party providers on the far side — the parts we don't control, drawn as external boxes from minute one.

concept · Delivery guarantees

At-most-once: fire and forget, may lose messages. At-least-once: retry until acknowledged, may duplicate. Exactly-once: a myth across system boundaries you don't control — the honest version is at-least-once plus idempotent processing. Quote that last clause verbatim in interviews.

concept · Opt-out as a hard requirement

SMS and email opt-outs are legally mandated (TCPA, CAN-SPAM, GDPR), not a nice-to-have. Preference checks must sit in the send path where they cannot be bypassed by any producer.

Checkpoint

1

Why do we target at-least-once rather than exactly-once delivery?