MapReduce | Hadoop Developer Self Learning Outline

With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on Map Reduce
In this post I am going to cover following topic.
  1. MapReduce concepts
  2. Daemons: jobtracker / tasktracker
  3. Phases: driver, mapper, shuffle/sort, and reducer
  4. First MapReduce job

1. MapReduce Concepts:

Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.Inspired from map and reduce operations commonly used in functional programming languages like Lisp.

  • Map Reduce is a programming model designed for processing of large volumes of distributed data. 
  • A task is distributed across multiple nodes and all nodes process their own data in parallel. 
  • Two phases to process data : Map and Reduce
                Map()‏ Process a key/value pair to generate intermediate key/value pairs
                map(k,v) à list(k1,v1) 
                Reduce()‏ Merge all intermediate values associated with the same key
                reduce(k1, list(v1)) à v2
  • Input: a set of key/value pairs
  • Map Reduce performs parallel computing because of its ‘Shared Nothing’ architecture i.e. the tasks have no dependency on other. 
  • MR can be written in Java, C++, Python etc. 

Pseudo-code:

map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit above, and 
// reduce called on list of values in each group.

reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));



Features:
  • Automatic parallelization and distribution 
  • Fault tolerance 
  • A clean abstraction for programmers
  • All of Hadoop is written in Java 
  • MapReduce abstracts all the ‘housekeeping’ away from the developer 
  • Developer can simply concentrate on writing the Map and Reduce functions
MR Flow: 

Map Reduce flow

Fig: MapReduce Flow

 MapReduce Execution Flow
                                                                     
Fig : MapReduce Execution Flow

source:Hadoop Definitve Guide.Heavy arrows show data transfer between nodes,dotted arrows – transfer within the node)

2. Daemons: 

MRv1 daemons

Job Tracker – one per cluster
  • JobTracker process runs on a separate node and not usually on a DataNode.
  • JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by ResourceManager/ApplicationMaster in MRv2.
  • JobTracker receives the requests for MapReduce execution from the client.
  • JobTracker talks to the NameNode to determine the location of the data.
  • JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the data) and the available slots to execute a task on a given node.
  • JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to the client.
  • JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
  • When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.

Task Tracker – one per slave node
  • TaskTracker runs on DataNode. Mostly on all DataNodes.
  • TaskTracker is replaced by Node Manager in MRv2.
  • Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
  • TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
  • TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in execution.
  • TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.

MRv2 daemons

Resource Manager – one per cluster 
  • Starts Application Masters, allocates resources on slave nodes 
Application Master – one per job 
  • Requests resources, manages individual Map and Reduce tasks 
Node Manager – one per slave node 
  • Manages resources on individual slave nodes 
Job History – one per cluster 
  •  Archives jobs’ metrics and metadata
Note: Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not recommended; it will degrade your performance and may result in an unstable MapReduce cluster deployment.

To start and stop MapReduce daemon (MR1v)

  • To start MapReduce daemon manually, enter the following command:

                                # $HADOOP_HOME/bin/start-mapred.sh

  • To stop MapReduce daemon manually, enter the following command:

                                # $HADOOP_HOME/bin/stop-mapred.sh


3. Phases: Driver, Mapper, Shuffle/Sort, and Reducer

The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.

The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.

The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitionerfor data split.

The partitioner decides which reducer will get a particular key value pair.

The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.

4. First MapReduce job
Run Your Map-Reduce code in Eclipse ( WordCount in Eclipse) 
detailed steps with screen shot are mentioned.

A hadoop Blog: Blog Link FB Page:Hadoop Quiz
Comment for update or changes..
Reference;
Definitive guide 4th edition, Cloudera and hadoop blogs.