Let’s design a text messaging app, a system for 100K users to send messages to each other within seconds. Do not consider video or audio chat. Users send messages at an unpredictable rate, so our system should be able to handle these traffic surges. Messages should not be lost, nor should they be sent more than once.

Requirements

We determined the following functional requirements:

  • Real-time or eventually-consistent? Consider either case.
  • How many users may a chatroom have? A chatroom can contain between two to 1,000 users.
  • Is there a character limit for a message? Let’s make it 1000 UTF-8 characters. At up to 32 bits/character, a message is up to 4 KB in size.
  • Notification is a platform-specific detail that we need not consider. Android, iOS, Chrome, and Windows apps each have their platform-specific notification library.
  • Delivery confirmation and read receipt.
  • Log the messages. Users can view and search up to 10 MB of their past messages. With one billion users, this works out to 10 PB of storage.
  • Message body is private. End-to-end encryption will be ideal.
  • Messaging apps allow users to see if their connections are online.
  • We consider sending text only, not media like voice messages, photos, or videos.

Non-functional requirements:

  • Scalability: 100K simultaneous users. Assume each user sends a 4 KB message every minute, which is a write rate of 400 MB/min. A user can have up to 1,000 connections, and a message can be sent to up to 1,000 recipients, each of whom may have up to five devices.
  • High availability: Four nines availability.
  • High performance: 10-second P99 (99% of the requests should be faster than given latency. In other words only 1% of the requests are allowed to be slower) message delivery time.
  • Security and privacy: Require user authentication. Messages should be private.
  • Consistency: Strict ordering of messages is not necessary. If multiple users send messages to each other more or less simultaneously, these messages can appear in different orders to different users.

High-level design

A user first selects the recipient (by name) of their message from a list of recipients. Next, they compose a message on a mobile, desktop, or browser app and then hit a Send button. The app first encrypts the message with the recipient’s public key and then makes a request to our messaging service to deliver the message. Our messaging service sends the message to the recipient. The recipient sends delivery confirmation and read receipt messages to the sender. This design has the following implications:

  • Our app needs to store each recipient’s metadata, including names and public keys.
  • Our messaging service needs to maintain an open WebSocket connection to each recipient.
  • If there is more than one recipient, the sender needs to encrypt the message with each recipient’s public key.
  • Our messaging service needs to handle unpredictable traffic surges from many senders suddenly deciding to send messages within a short period.

Referring to below figure, we create separate services to serve different functional requirements and optimize for their different nonfunctional requirements.

  • Sender service: Receives messages from senders and immediately delivers them to recipients. It also records these messages in the message service.
  • Message service: Senders can make requests to this service for their sent messages, while recipients can make requests to this service for both their received and unreceived messages.
  • Connection service: For storage and retrieval of users’ active and blocked connections, add other users to one’s contact list, block other users from sending messages. The connection service also stores connection metadata, such as names, avatars, and public keys.

Users make requests to our services via an API gateway. Our sender service makes requests to our message service to record messages, including messages that failed to be delivered to recipients. It also makes requests to our connection service to check if a recipient has blocked the message sender. We discuss more details in subsequent sections.

High-level architecture with the relationships between our services.

Connection service

The connection service should provide the following endpoints:

  • GET /connection/user/{userId}: GET all of a user’s connections and their metadata, including both active and blocked connections and active connections’ public keys. We may also add additional path or query parameters for filtering by connection groups or other categories.
  • POST /connection/user/{userId}/recipient/{recipientId}: New connection request from a user with userId to another user with recipientId.
  • PUT /connection/user/{userId}/recipient/{recipientId}/request/{accept}: Accept is a Boolean variable to accept or reject a connection request.
  • PUT /connection/user/{userId}/recipient/{recipientId}/block/{block}: Block is a Boolean variable to block or unblock a connection.
  • DELETE /connection/user/{userId}/recipient/{recipientId}: Delete a connection.

Users’ connections (including both active and blocked connections) should be stored on users’ devices (i.e., in their desktop or mobile apps) or in browser cookies or localStorage, so the connection service is a backup for this data in case a user changes devices or to synchronize this data across a user’s multiple devices. We do not expect heavy write traffic or a large amount of data, so we can implement it as a simple stateless backend service that stores data in a shared SQL service.

To reduce traffic to the server, blocked recipient connections should be stored on a user’s device, so the device can prevent the user from interacting with this recipient, and the server does not have to block such undesired interactions. Whether we wish to inform a user that another user has blocked them is a UX design decision that is up to us.

Public keys

When a device installs (or reinstalls) our app and starts our app for the first time, it generates a public-private key pair. It should store its public key in the connection service. The connection service should immediately update the user’s connections with the new public key via their WebSocket connections.

As a user may have up to 1,000 connections, each with five devices, a key change may require up to 5,000 requests, and some of these requests may fail because the recipients may be unavailable. Key changes will likely be rare events, so this should not cause unpredicted traffic surges, and the connection service should not need to use message brokering or Kafka. A connection who didn’t receive the update can receive it in a later GET request.

If a sender encrypts their message with an outdated public key, it will appear as gibberish after the recipient decrypts it. To prevent the recipient device from displaying such errors to the recipient user, the sender can hash the message with a cryptographic hash function such as SHA-2 and include this hash as part of the message. The recipient device can hash the decrypted message and display the decrypted message to the recipient user only if the hashes match. The sender service can provide a special message endpoint for a recipient to request the sender to resend the message. The recipient can include its public key, so the sender will not repeat this error and can also replace its outdated public key with the new one.

One way to prevent such errors is that a public key change should not be effective immediately. The request to change a public key can include a grace period (such as seven days) during which both keys are valid. If a recipient receives a message encrypted with the old key, it can send a special message request to the Sender Service containing the new key, and the sender service requests the sender to update the latter’s key.

Sender service

The sender service is optimized for scalability, availability, and performance of a single function, which is to receive messages from senders and deliver them to recipients in near real time. It should be made as simple as possible to optimize debuggability and maintainability of this critical function. If there are unpredicted traffic surges, it should be able to buffer these messages in a temporary storage, so it can process and deliver them when it has sufficient resources.

Below figure shows the high-level architecture of our sender service. It consists of two services with a Kafka topic between them.

A message has the fields sender ID, a list of up to 1,000 recipient IDs, body string, and message sent status enum (the possible statuses are “message sent,” “message delivered,” and “message read”).

Sending a message occurs as follows. On the client, a user composes a message with a sender ID, recipient IDs, and a body string. Delivery confirmation and read receipt are initialized to false. The client encrypts the body and then sends the message to the sender service.

The new message service receives a message request, produces it to the new message Kafka topic, then returns 200 success to the sender. A message request from one sender may contain up to 5,000 recipients, so it should be processed asynchronously this way. The new message service may also perform simple validations, such as whether the request was properly formatted, and return 400 error to invalid requests (as well as trigger the appropriate alerts to developers).

Message generator consumes from the new message Kafka topic and generates a separate message for each recipient. It may fork a thread or maintain a thread pool to generate a message. The host may also write a checkpoint to a distributed in-memory database such as Redis. If the host fails while generating messages, its replacement can look up this checkpoint, so it doesn’t generate duplicate messages.

The message consumer service consumes from the recipient topic and then does the following steps.

  1. Check if the sender should have been blocked. The message-sending service should store this data, instead of having to make a request to the connection service for every message. If a message has a blocked sender, it indicates that the client-side blocking mechanism has failed, possibly due to bugs or malicious activity. In this case, we should trigger an alert to developers.
  2. Each message-sending service host has WebSocket connections with a number of recipients. We can experiment with this number to determine a good balance. Using a Kafka topic allows each host to serve a larger number of recipients, since it can consume from the Kafka topic only when it is ready to deliver a message. The service can use a distributed configuration service like ZooKeeper to assign hosts to devices. The host assigner service uses a ZooKeeper service to maintain a mapping of device IDs to hosts.
    1. The message-sending service host that is handling the current message can query the host assigner service for the appropriate host and then request that host to deliver the message to the recipient.
    2. In parallel, the message-sending service should also log the message to the message service, which is discussed further in the next section.
  3. The sender service sends the message to the recipient client. If the message cannot be delivered to the recipient client (most likely because the recipient device is off or doesn’t have internet connectivity), we can simply drop the message because it has already been recorded in the message service and can be retrieved by the device later.
  4. The receiver can ensure that the message isn’t a duplicate and then display it to the user. The receiver app can also trigger a notification on the user’s device.
  5. When the user reads the message, the app can send a read receipt message to the sender, which can be delivered in a similar manner.

The message service can have a retention period of a few weeks, after which it deletes the message. When a recipient device comes online, it can query the messaging service for new messages. This request will be directed to its host, which can query the metadata service for the new messages and return them to the recipient device.

The message-sending service also provides an endpoint to update blocked/unblocked senders. The connection service makes requests to the message-sending service to update blocked/unblocked senders. The connection service and message- sending service are separate to allow independent scaling; we expect more traffic on the latter than the former.

Message service

Our message service serves as a log of messages. Users may make requests to it for the following purposes:

  • If a user just logged in to a new device or the device’s app storage was cleared, the device will need to download its past messages (both its sent and received messages).
  • A message may be undeliverable. Possible reasons include being powered off, being disabled by the OS, or no network connectivity to our service. When the client is turned on, it can request the message service for messages that were sent to it while it was unavailable.

For privacy and security, our system should use end-to-end encryption, so messages that pass through our system are encrypted. An additional advantage of end-to-end encryption is that messages are automatically encrypted both in transit and at rest. We can understand end-to-end encryption in three simple steps:

  1. A receiver generates a public-private key pair.
  2. A sender encrypts a message with the receiver’s public key and then sends the receiver the message.
  3. A receiver decrypts a message with their private key.

After the client successfully receives the messages, the message service can have a retention period of a few weeks, after which it deletes the messages to save storage and for better privacy and security.

Search

Each user can only search on their own messages. We may implement search-to-search directly in text messages, and not build a reverse index on each client, avoiding the costs of design, implementation, and maintenance of a reverse index. The storage size of an average client’s messages will probably be far less than 1 GB (excluding media files). It is straightforward to load these messages into memory and search them.

We may search on media file names, but not on the content of the files themselves. Search on byte strings is outside the scope of this book.