Problem
A large eCommerce company wishes to list the best-selling products, overall and by category. For example, one product might be the #1056th best-selling product overall but th #13th best-selling product under Sports Equipment and the #24th best-selling product under Safety. Describe how you would design the system
Solution
Scope the Problem
First, we need to know if there is any limit on how many best-selling products we want to list. This is imporant to know what kind of scale we’re dealing with. Since the problem mentions an example with the 1056th best-selling product, we will assume we want to sort everything.
So, this question is really about building an analytics system that sorts products based on the number of sales, both overall and category wise.
Make Reasonable Assumptions
Let’s say there are a few million users (say 10 million), and the average user makes one purchase every week. So we have about 10 million purchases / week, which is about 15 purchases / second. Assume that there are about 30 categories. Let’s assume that this eCommerce platform has quite a lot of products – about 20 million. Assuming a uniform distribution (which might not really be accurate…), this places about 666,666 products in each category.
So, we want to design a system that keeps a global sorted view on the number of sales of all of the 20 million products, and also keeps a sorted view of each category.
Finally, what is the timespan of the ranking system? It is a rank over all the sales ever done, or should we limit it to, say, last week’s sales? We will assume that the rank data should include sales that took place within the last week.
Design
Should we instantly update our ranking metrics when a purchase takes place? Or is it ok if we just fetch updated data periodically? Analytics / ranking systems usually don’t need to be 100% exact, and in fact, that’s not advisable, as the cost of doing so would be prohibitively high without any immediate benefit.
So we will periodically update the rank data. In this design, we would have another component in our architecture that stores purchase entries that weren’t processed yet. There would be a logging module where we accumulate this data, and periodically (say every 30 minutes) the ranking module queries the logging module to get the latest updates. This information is added to the ranking module, we sort everything again, and we update stats accordingly. This seems like a better choice here: as long as the logging module can handle the overall purchase/second load, the ranking system will scale naturally, because even if the number of purchases per second grows, the rate at which we perform sorting remains constant (but of course, we will get more data during the same time interval).
Logging system only needs to keep track of the number of times each product in each category was bought. So we can have a set of hash tables – one per category. Here’s the typical flow:
- A new purchase occurs
- The new purchase information (product ID) is sent to the logging system
- The logging system increments the count of purchases seen for that product
- When new information is fetched by the ranking system, the logging system is cleared, because we won’t fetch it again.
- The ranking system, in turn, merges this new information into the database and sorts everything. Also, it purges data that is older than one week
Let’s assume a product ID is 8 bytes long, and the associated count number is 4 bytes (so 12 bytes per entry). In the worst case, there are 20 million entries in the logging system on hold, so that would be about 240 MB. A simple hash table will do. Also, it is relatively scalable; if the hash table grows as large as 1 GB, that would mean we have about 0.84*10^8 = 84 million products on hold, so we have some growth margin. Remember that the hash table is cleared when the ranking system fetches new information, which happens every 30 minutes (we could actually tune this value over time).
Ok, now for the ranking system: first, it needs to keep the count of purchases for each known product in the last week. We could just have a large list of (product ID, count) pairs. Once we sort the global category and every other category, we write the results into a bunch of files and make them available to the frontend components.