Design a Chat System
WhatsApp-style messaging — the case that teaches you stateful servers.
You're asked: "Design a chat system like WhatsApp or Messenger." One-on-one and group conversations, messages delivered in real time, with the little checkmarks users obsess over.
Every system you've designed so far was request/response: the client asks, the server answers, the connection dies. Chat breaks that model — the server must push data to the client the moment a message arrives. That single requirement forces WebSockets, which forces stateful servers, which changes how you do load balancing, scaling, and failover.
Interview framing: this case is really two interleaved problems — (1) keeping a live pipe open to millions of phones, and (2) reliably storing and routing messages between those pipes. Keep them separate in your head and on the whiteboard, and you'll sound organized while other candidates draw spaghetti.
01Requirements & estimation
Lock the scope before drawing anything.
Functional
- 1:1 chat and group chat (clarify group size — say ≤ 500 members, not broadcast channels)
- Online presence: show who's online / last seen
- Delivery & read receipts: sent ✓, delivered ✓✓, read (blue)
- Message ordering: messages in a conversation appear in the same order for everyone
- Multi-device sync (phone + laptop see the same history)
Non-functional
- Real-time: delivery latency well under a second when both users are online
- No lost messages — durability matters more than in most systems; a dropped "I'm at the gate" is a product failure
- Messages are small (text), but volume is enormous
Back-of-envelope — assume 50M DAU, 40 messages sent per user per day:
- 50M × 40 = 2B messages/day ≈ 2B / 86,400 ≈ ~23K messages/sec average, plan for peak ~100K/sec
- Storage: ~200 bytes/message × 2B/day ≈ 400 GB/day ≈ ~150 TB/year — too big and too write-hot for one box; sharding is coming
- Concurrent connections: most of 50M DAU keep an idle connection open. That number — tens of millions of open sockets — is the defining constraint of this design. Say it out loud.
The scoring sentence for this stage: "The hard part isn't message volume — it's holding tens of millions of long-lived connections and knowing which connection belongs to which user."
Sent = server accepted it. Delivered = recipient's device received it. Read = recipient opened the conversation. Each is a distinct event flowing back to the sender — naming all three shows you've thought about the product, not just the boxes.
Checkpoint
What is the single most distinctive constraint of a chat system versus a typical web service?