Suppose a very large text file of size say 9 GB. We need to sort the file and store it. But available memory (RAM) is limited, say 1 GB.
A programmer views a random access file stored on disk as a contiguous series of bytes i.e. consecutive logical address. But physical file stored on disk is usually not a contiguous series of bytes. It could be in pieces spread all over the disk. File manager of the operating system, is responsible for retrieving data from (or saving data to) a logical file and mapping those requests to the physical location of the data on disk.
Large text file is too large to be loaded into main memory. File size is 9 GB and RAM is 1 GB. So the file must reside on disk (or come as stream) and be brought into main memory as required for processing. The idea is to break the big file into smaller chunk which can fit into RAM. Sort each chunk and merge these to sort the whole file.
One way to sort big file by using external merge sort. External merge sort algorithm sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 9 GB of data using only 1 GB of RAM:
- Read 1 GB of the data in main memory and sort by using merge sort. Then write the sorted data to disk.
- Repeat steps 1 until all of the data is in sorted (there are 9 GB / 1 GB = 9 chunks).
- Read the first 100 MB of each sorted chunk (of 1 GB) into input buffers in main memory. Allocate the remaining 100 MB for an output buffer. So in total there are 10 chunks (9 chunk for reading sorted data and 1 chunk for writing sorted data).
- Perform a 9 way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. During merging, each chunk does not have to be loaded completely, rather sequential parts of the chunk can be loaded as needed.
- Whenever any of the 9 input buffers empties, fill it with the next 100 MB of its associated 1GB sorted chunk until no more data from the chunk is available.