MapReduce Interview Question Part1

Q1 What is MapReduce?
Answer: MapReduce is a parallel programming model which is used to process large data sets across hundreds or thousands of servers in a Hadoop cluster.Map/reduce brings compute to the data at data location in contrast to traditional parallelism, which brings data to the compute location.The Term MapReduce is composed of Map and Reduce phase.  The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples key/value pairs. The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job. The programming language for MapReduce is Java.All data emitted in the flow of a MapReduce program is in the form of Key/Value pairs.

Q2 Explain a MapReduce program.
Answer: A MapReduce program consists of 3 parts namely, Driver, Mapper, and Reducer.
The Driver code runs on the client machine and is responsible for building the configuration of the job and submitting it to the Hadoop Cluster. The Driver code will contain the main() method that accepts arguments from the command line.
The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends MapReduceBase and implements the Mapper interface. The Mapper interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the input key and value types, the second two define the output key and value types.
The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value pairs. The Reducer class extends MapReduceBase and implements the Reducer interface. The Reducer interface expects four generics, which define the types of the input and output key/value pairs. The first two parameters define the intermediate key and value types, the second two define the final output key and value types.
Q3 Mention what are the main configuration parameters that user need to specify to run MapReduce Job ?
 Answer:The user of MapReduce framework needs to specify the following:
  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format
  • Output format
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes


Q4 What Mapper does?
Answer: Mapper is the first phase of MapReduce phase which process map task.Mapper reads key/value pairs and emit key/value pair.Maps are the individual tasks that transform input records into intermediate records.The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.
Q5  Is there an easy way to see the status and health of a cluster?
Answer: There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.
Q6 Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
Answer:
  • apache.hadoop.mapreduce.Mapper
  • apache.hadoop.mapreduce.Reducer
Q7 Explain what is Sequencefileinputformat?
Answer: Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
Q8 What are ‘maps’ and ‘reduces’?
Answer: ‘Maps’ and ‘Reduces’ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input location, and based on the input type, it will generate a key value pair,that is, an intermediate output in local machine.’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.
Q9 What does conf.setMapper Class do?
Answer: Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.
Q10 What are the methods in the Reducer class and order of their invocation?
Answer: The Reducer class contains the run() method, which call its own setup() method only once, it also call a reduce() method for each input and finally calls it cleanup() method.