Problem
You have 10 billion URLs. How do you detect the duplicate documents? In this case, assume that “duplicate” means that the URLs are identical.
Approach
If each URL is an average of 100 characters, and each character is 4 bytes, then this list of 10 billion URLs will take up about 4 terabytes. We are probably not going to hold that much data in memory.
But, let’s just pretend for a moment that we were miraculously holding this data in memory, since it’s useful to first construct a solution for the simple version. Under this version of the problem, we would create a hash table where each URL maps to true if it’s already been found elsewhere in the list.
Now that we have a solution for the simple version, we can solve the problem for 400 gigabytes of data either by storing some of the data on disk or by splitting up the data across machines.
If we stored all the data on one machine, we would do two passes of the document. The first pass would split the list of URLs into 400 chunks of 1 GB each. An easy way to do that might be to store each URL u in a file named <x>.txt where x = hash(u) % 400. That is, we divide up the URLs based on their hash value (modulo the number of chunks). This way, all URLs with the same hash value would be in the same file. In the second pass, we would implement the solution we came up earlier i.e. load each file into memory, create a hash table of the URLs, and look for duplicates.
The other solution is to perform essentially the same procedure, but to use multiple machines. In this solution, rather than storing the data in file <x>.txt, we would send the URL to machine x.
Using multiple machines, we can parallelize the operation, such that all 400 chunks are processed simultaneously. For large amounts of data, this might result in a faster solution. The disadvantage though is that we are now relying on 400 different machines to operate perfectly. That may not be realistic (particularly with more data and more machines), and we’ll need to start considering how to handle failure.