Massive I/O Caused by Large Input Data in Map Input Stage

11:50:00 PM 3:09:08 PM

Problem 1 – Massive I/O Caused by Large Input Data in Map Input Stage

This problem happens most often on jobs with light computation and large volumes of source data. If disk I/O is not fast enough, computation resources will be idle and spend most of the job time waiting for the incoming data. Therefore, performance can be constrained by disk I/O.

We can identify this issue with high values in below job counters.

Job counters: Bytes Read, HDFS_BYTES_READ

Solution 1: Compress Input Data

Compress Input data – Compression of files saves storage space on HDFS and also improves speed of transfer.

We can use any of the below compression techniques on input data sets.

Format

Codec

Extension

Splittable

Hadoop

DEFLATE

org.apache.hadoop.io.compress.DefaultCodec

.deflate

N

Y

Gzip

org.apache.hadoop.io.compress.GzipCodec

.gz

N

Y

Bzip2

org.apache.hadoop.io.compress.BZip2Codec

.bz2

Y

Y

LZO

com.hadoop.compression.lzo.LzopCodec

.lzo

N

Y

LZ4

org.apache.hadoop.io.compress.Lz4Codec

.Lz4

Y

Y

Snappy

org.apache.hadoop.io.compress.SnappyCodec

.Snappy

Y

Y

When we submit a MapReduce job against compressed data in HDFS, Hadoop will determine whether the source file is compressed by checking the file name extension, and if the file name has an appropriate extension, Hadoop will decompress it automatically using the appropriate codec. Therefore, users do not need to explicitly specify a codec in the MapReduce job.

However, if the file name extension does not follow naming conventions, Hadoop will not recognize the format and will not automatically decompress the file. Therefore, to enable self-detection and decompression, we must ensure that the file name extension matches the file name extensions supported by each codec.
Visit : here (for more info)

Hadoop Queries

Format	Codec	Extension	Splittable	Hadoop
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec	.deflate	N	Y
Gzip	org.apache.hadoop.io.compress.GzipCodec	.gz	N	Y
Bzip2	org.apache.hadoop.io.compress.BZip2Codec	.bz2	Y	Y
LZO	com.hadoop.compression.lzo.LzopCodec	.lzo	N	Y
LZ4	org.apache.hadoop.io.compress.Lz4Codec	.Lz4	Y	Y
Snappy	org.apache.hadoop.io.compress.SnappyCodec	.Snappy	Y	Y

Massive I/O Caused by Large Input Data in Map Input Stage

Post a Comment