Massive I/O Caused by Spilled Records in Partition and Sort phases

Problem 2 – Massive I/O Caused by Spilled Records in Partition and Sort phases

When the map function starts producing output, it is not simply written to disk. Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default. When the contents of the buffer reaches a certain threshold size, a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete. Spills are written in round-robin fashion to the directories specified by the mapred.local.dir property, in a job-specific sub directory.


To optimize the Map task outputs, we need to ensure that records are spilled (meaning, written to LFS or HDFS file) only once. When records are spilled more than once, the data must be read in and written out multiple times, causing drains on I/O. If buffer size is too small and it is filled up too quickly, then it will lead to multiple spills. Each extra spill generates a large volume of data in intermediate bytes. 

We can identify this issue with high values in below job counters. • Job counters: FILE_BYTES_READ, FILE_BYTES_WRITTEN, Spilled Records


Solution 2: Adjust Spill Records and Sorting Buffer
To reduce the amount of data spilled during the intermediate Map phase, we can adjust the following properties for controlling sorting and spilling behavior.


mapreduce.task.io.sort.factor
10
The number of streams to merge at once while sorting files. This determines the number of open file handles.
mapreduce.task.io.sort.mb
100
The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.
mapreduce.map.sort.spill.percent
0.80
The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5

When Map output is being sorted, 16 bytes of metadata are added immediately before each key-value pair. These 16 bytes include 12 bytes for the key-value offset and 4 bytes for the indirect-sort index. Therefore, the total buffer space defined in io.sort.mb can be divided into two parts: metadata buffer and key-value buffer.
Formula for io.sort.mb
io.sort.mb = (16 + R) * N / 1,048,576

R – the average length of the key-value pairs (in bytes) and can be calculated by dividing the Map output bytes by the number of Map output records from job counters. 

N – calculated by dividing the Map output records by the number of map tasks. 

Update the io.sort.spill.percent property to 1.0 to make use of complete buffer space.