Problem

Design a system like Pastebin, where a user can enter a piece of text and get a randomly generated URL to access it.

Solution

Pastebin is a web service that enable users to store plain text over the network and generate unique URLs to access the uploaded data. It is also used to share data over the network quickly, as users would just need to pass the URL to let other users see it.

Functional Requirements:

  • Users should be able to upload plain text called as “Pastes”.
  • Upon uploading the paste user should be provided unique url to access it.
  • Users should only be able to upload text.
  • Url’s should be expired after certain period of time, if expiration time not provided by the user.

Non-Functional Requirements:

  • System should be highly reliable i.e no data loss.
  • System should be highly available i.e user should be able to access their paste most of the time.
  • Minimum latency to fetch user pastes.

State assumptions

  • Traffic is not evenly distributed
  • Following a short link should be fast
  • Page view analytics do not need to be realtime
  • 10 million users
  • 10 million paste writes per month
  • 100 million paste reads per month
  • 10:1 read to write ratio

Storage calculation usage

  • Size per paste 1 KB content per paste
  • shortlink – 7 bytes
  • expiration_length_in_minutes – 4 bytes
  • created_at – 5 bytes
  • paste_path – 255 bytes
  • total = ~1.27 KB
  • 12.7 GB of new paste content per month
    1. 1.27 KB per paste * 10 million pastes per month
    2. ~450 GB of new paste content in 3 years
    3. 360 million shortlinks in 3 years

System APIs

Can be implemented as Restful Webservices as –

  1. createPaste
  2. updatePaste
  3. getPaste
  4. deletePaste

Use Cases

There are two immediate use cases that come to mind when we think about a pastebin service:

  1. Add a new paste in the system (and get a unique URL)
  2. Retrieve a paste entry given its URL

Our service will allow a user to choose whether a paste expires within 1 hour, 1 day, 1 week, 2 weeks, 1 month or never. When an entry expires, its unique URL may be reused for new posts afterwards. Custom URLs are supported i.e. user will be given the possibility to pick a URL at his / her convenience, but this is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on custom URLs, so that we have a consistent URL database. Assume that anyone who has the URL is free to visit the paste entry.

Design

We will need an application service layer that processes incoming requests and forwards outbound requests. The application service layer talks to a backend datastore component. So, clients issue read and write requests to the public application service layer. The flow for the two use cases is described as follows:

  • Add a new paste: Contact the application service layer and issue a InsertPost(ID) request. The ID will be automatically generated by the application service layer if the user didn’t pick one (recall that users are allowed to pick a custom ID). The application service layer forwards the request to the backend datastore, which checks to see if the requested ID is already in the database. If it is, the database is not changed and an error code is returned; otherwise, a new entry is added with the specified ID.
  • Retrieve a paste entry: Application service layer contacts the datastore. The datastore searches for the given ID, and if it is found, returns the post contents, otherwise, an error code is returned.

Note that in the design above, the application layer is responsible for generating an ID if the user doesn’t provide one. However, the application layer has no internal knowledge of the data in the system, so it may unintentionally generate an ID that already exists. Alternatively, we could shift this responsibility to the datastore and make the application layer a dumb proxy that just forwards requests to the backend. The main advantage is that we reduce the number of interactions between the application layer and the backend datastore.

We run some automated job every hour that scans the entire dataset, reads the expiration information and deletes an entry if it expired. On the plus side, it’s simple and it works. But there are many downsides. Scanning the dataset hourly is expensive and utterly infeasible when the data accumulates to terabytes and terabytes of information. Another option is to perform lazy expiration. Expired posts just sit there until some event happens and suddenly we realize that some posts expired. For example, someone makes a request to retrieve a post that expired. The datastore finds the post, but it sees that it’s past due, so the post is purged and an error message is returned to the user. Or someone tried to create a new post with an ID that is already in use, but when we find that ID, we see that the associated post has expired, so we purge it and add the new post with the same ID.

At a high level, the system needs an application server that can serve read and write request. Application server will store paste data on block storage. All the metadata related to paste and user will be stored into a database. At a high level, various cache servers and load balancer can be configured to improve performance and scalabilty of the system.