Massive I/O Caused by Large Input Data in Map Input Stage

Problem 1 – Massive I/O Caused by Large Input Data in Map Input Stage
This problem happens most often on jobs with light computation and large volumes of source data. If disk I/O is not fast enough, computation resources will be idle and spend most of the job time waiting for the incoming data. Therefore, performance can be constrained by disk I/O.

We can identify this issue with high values in below job counters.

  • Job counters: Bytes Read, HDFS_BYTES_READ

Solution 1:  Compress Input Data

Compress Input data – Compression of files saves storage space on HDFS and also improves speed of transfer.

We can use any of the below compression techniques on input data sets.


Format
Codec
Extension
Splittable
Hadoop
DEFLATE
org.apache.hadoop.io.compress.DefaultCodec
.deflate
N
Y
Gzip
org.apache.hadoop.io.compress.GzipCodec
.gz
N
Y
Bzip2
org.apache.hadoop.io.compress.BZip2Codec
.bz2
Y
Y
LZO
com.hadoop.compression.lzo.LzopCodec
.lzo
N
Y
LZ4
org.apache.hadoop.io.compress.Lz4Codec
.Lz4
Y
Y
Snappy
org.apache.hadoop.io.compress.SnappyCodec
.Snappy
Y
Y

When we submit a MapReduce job against compressed data in HDFS, Hadoop will determine whether the source file is compressed by checking the file name extension, and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job.

However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file. Therefore, to enable self-detection and decompression, we must ensure that the file name extension matches the file name extensions supported by each codec.
Visit : here (for more info)