MapReduce Interview Question Part3

Q21 What does job conf class do?
Answer: MapReduce needs to logically separate different jobs running on the same cluster. ‘Job conf class’ helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.
Q22Is it important for Hadoop MapReduce jobs to be written in Java?
Answer: It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.
Q23 What is a Combiner?
Answer: A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the reducers.
Q24 What do sorting and shuffling do?
Answer: Sorting and shuffling are responsible for creating a unique key and a list of values.Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.
Q25 What are the four basic parameters of a reducer?
Answer: The four basic parameters of a reducer are Text, IntWritable, Text, and IntWritable.The first two represent intermediate output parameters and the second two represent final output parameters.
Q26 What are the key differences between Pig vs MapReduce?
Answer: PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing. While this can be done for any kind of processing tasks Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in Pig, some default ones are like ordering, grouping, distinct, count etc.
Mapreduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that it is easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Not all applications can be migrated to Map reduce but good few can be including complex ones like k-means to simple ones like counting unique in a dataset.
Q27 Why we cannot do aggregation or addition in a mapper? Why we require reducer for that?
Answer: We cannot do aggregation or addition in a mapper because sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.
Q28 What does a split do?
Answer: Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split Method’. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but reads data from the block and pass it to the mapper.Be default, Split is taken care by the framework. Split method is equal to the block size and is used to divide block into bunch of splits.
Q29 What does the text input format do?
Answer: In text input format, each line will create a line off-set, that is a hexa-decimal number. Key is considered as a line off-set and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper will receive the ‘key’ as a ‘LongWritable’ parameter and value as a ‘Text’ parameter.
Q30 What does a MapReduce partitioner do?
Answer: A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer thus allows evenly distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.