Craigslist is an example of a typical web application that may have more than a billion users. It is partitioned by geography.

User stories

We distinguish two primary user types: viewer and poster.

A poster should be able to create and delete a post and search their posts as they may have many. This post should contain the following information:

  • Title.
  • Some paragraphs of description.
  • Price. Assume a single currency and ignore currency conversions.
  • Location.
  • Up to 10 photos of 1 MB each.

A poster can renew their post every seven days. They will receive an email notification with a click-through to renew their post.

A viewer should be able to

  1. View all posts or search in posts within any city made in the last seven days. View a list of results, possibly as an endless scroll.
  2. Apply filters on the results.
  3. Click an individual post to view its details.
  4. Contact the poster, such as by email.
  5. Report fraud and misleading posts (e.g., a possible clickbait technique is to state a low price on the post but a higher price in the description).

Requirements

The non-functional requirements are as follows:

  • Scalable—Up to 10 million users in a single city.
  • High availability—99.9% uptime.
  • High performance—Viewers should be able to view posts within seconds of creation.
  • Security—A poster should log in before creating a post. We can use an authentication library or service.

Most of the required storage will be for Craigslist posts. The amount of required storage is low:

  • We may show a Craigslist user only the posts in their local area. This means that a data center serving any individual user only needs to store a fraction of all the posts (though it may also back up posts from other data centers).
  • Posts are manually (not programmatically) created, so storage growth will be slow.
  • A post may be automatically deleted after one week.

low storage requirement means that all the data can fit into a single host, so we do not require distributed storage solutions. Let’s assume an average post contains 1,000 letters or 1 KB of text. If we assume that a big city has 10 million people and 10% of them are posters creating an average of 10 posts/day (i.e., 10 GB/day), our SQL database can easily store months of posts.

API Design

Let’s scribble down some API endpoints, separated into managing posts and managing users.

CRUD (Create, read, update and delete) posts:

  • GET and DELETE /post/{id}
  • GET /post?search={search_string}. This can be an endpoint to GET all posts. It can have a “search” query parameter to search on posts’ content. We may also implement query parameters for pagination
  • POST and PUT /post
  • POST /contact
  • POST /report
  • DELETE /old_posts

User management:

  • POST /signup. We do not need to discuss user account management.
  • POST /login
  • DELETE /user

SQL database schema

We can design the following SQL schema for our Craigslist user and post data.

  • User: id PRIMARY KEY, first_name text, last_name text, signup_ts integer
  • Post: This table is denormalized, so JOIN queries are not required to get all the details of a post. id PRIMARY KEY, created_at integer, poster_id integer, location_id integer, title text, description text, price integer, condition text, country_code char(2), state text, city text, street_number integer, street_name text, zip_code text, phone_number integer, email text
  • Images: id PRIMARY KEY, ts integer, post_id integer, image_address text
  • Report: id PRIMARY KEY, ts integer, post_id integer, user_id integer, abuse_type text, message text
  • Storing images: We can store images on an object store.
  • image_address: The identifier used to retrieve an image from the object store.

When low latency is required, such as when responding to user queries, we usually use SQL or in-memory databases with low latency such as Redis. NoSQL databases that use distributed file systems such as HDFS are for large data-processing jobs.

Writing and Reading posts

The backend writes the post to the SQL database and returns the post ID. For the backend to ensure that the entire post is uploaded successfully, it must upload the images to the object store itself. The backend can only return 200 success to the client after all image files are successfully uploaded to the object store. Image file uploads can be unsuccessful due to reasons such as the backend host crashing during the upload process, network connectivity problems, or if the object store is unavailable.

Sequence diagram of writing a new post. Backend handles image uploads

The sequence diagram of a viewer reading a post is similar to above, except that we have GET instead of POST requests. When a viewer reads a post, the backend fetches the post from the SQL database and returns it to the client. Next, the client fetches and displays the post’s images from the object store. The image fetch requests can be parallel, so the files are stored on different storage hosts and are replicated, and they can be downloaded in parallel from separate storage hosts.

Functional partitioning

The first step in scaling up can be to employ functional partitioning by geographical region, such as by city. This is commonly referred to as geolocation routing, serving traffic based on the location DNS queries originate from the geographic location of our users. We can deploy our application into multiple data centers and route each user to the data center that serves their city, which is also usually the closest data center. So, the SQL cluster in each data center contains only the data of the cities that it serves.

Craigslist does this geographical partitioning by assigning a subdomain to each city (e.g., sfbay.craigslist.org, shanghai.craiglist.org, etc). If we go to craigslist.org in our browser, the following steps occur.

  • Our internet service provider does a DNS lookup for craigslist.org and returns its IP address.
  • Our browser makes a request to the IP address of craigslist.org. The server determines our location based on our IP address, which is contained in the address, and returns a 3xx response with the subdomain that corresponds to our location.
  • Another DNS lookup is required to obtain the IP address of this subdomain.
  • Our browser makes a request to the IP address of the subdomain. The server returns the webpage and data of that subdomain.

It is unlikely that we will need to go beyond functional partitioning and caching. If we do need to scale reads, we can do SQL replication.

Email service

Our backend can send requests to a shared email service for sending email. To send a renewal reminder to posters when a post is seven days old, this can be implemented as a batch ETL job that queries our SQL database for posts that are seven days old and then requests the email service to send an email for each post. Extract, transform, and load (ETL) is the process of combining data from multiple sources into a large, central repository called a data warehouse. ETL uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning (ML).

The notification service for other apps may have requirements such as handling unpredictable traffic spikes, low latency, and notifications should be delivered within a short time.

We can create an Elasticsearch index on the Post table for users to search posts. We can discuss if we wish to allow the user to filter the posts before and after searching, such as by user, price, condition, location, recency of post, etc., and we make the appropriate modifications to our index.

Removing old posts

Craigslist posts expire after a certain number of days, upon which the post is no longer accessible. This can be implemented with a cron job or Airflow, calling the DELETE /old_posts endpoint daily. DELETE /old_posts may be its own endpoint separate from DELETE /post/{id} because the latter is a single simple database delete operation, while the former contains more complex logic to first compute the appropriate timestamp value then delete posts older than this timestamp value. Both endpoints may also need to delete the appropriate keys from the Redis cache.

Monitoring and alerting

Besides what was discussed in section 2.5, we should monitor and send alerts for the following:

  • Our database monitoring solution should trigger a low-urgency alert if old posts were not removed.
  • Anomaly detection for:
    • Number of posts added or removed.
    • High number of searches for a particular term.
    • Number of posts flagged as inappropriate.

Summary of architecture discussion so far

Below figure shows our Craigslist architecture with many of the services we have discussed, namely the client, backend, SQL, cache, notification service, search service, object store, CDN, logging, monitoring, alerting, and batch ETL.

Craigslist architecture