MapReduce | Hadoop Developer Self Learning Outline

10:15:00 PM 10:15:05 PM

With reference to my earlier post related to Hadoop Developer Self Learning Outline.

I am going to write short and simple tutorial on Map Reduce

In this post I am going to cover following topic.

MapReduce concepts
Daemons: jobtracker / tasktracker
Phases: driver, mapper, shuffle/sort, and reducer
First MapReduce job

1. MapReduce Concepts:

Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.Inspired from map and reduce operations commonly used in functional programming languages like Lisp.

Map Reduce is a programming model designed for processing of large volumes of distributed data.
A task is distributed across multiple nodes and all nodes process their own data in parallel.
Two phases to process data : Map and Reduce

Map()‏ Process a key/value pair to generate intermediate key/value pairs

map(k,v) à list(k1,v1)

Reduce()‏ Merge all intermediate values associated with the same key

reduce(k1, list(v1)) à v2

Input: a set of key/value pairs
Map Reduce performs parallel computing because of its ‘Shared Nothing’ architecture i.e. the tasks have no dependency on other.
MR can be written in Java, C++, Python etc.

Pseudo-code:

map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
// Group by step done by system on key of intermediate Emit above, and

// reduce called on list of values in each group.

reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));

Features:

Automatic parallelization and distribution
Fault tolerance
A clean abstraction for programmers
All of Hadoop is written in Java
MapReduce abstracts all the ‘housekeeping’ away from the developer
Developer can simply concentrate on writing the Map and Reduce functions

MR Flow:

Fig: MapReduce Flow

Fig : MapReduce Execution Flow

source:Hadoop Definitve Guide.Heavy arrows show data transfer between nodes,dotted arrows – transfer within the node)

2. Daemons:

MRv1 daemons

Job Tracker – one per cluster

JobTracker process runs on a separate node and not usually on a DataNode.
JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by ResourceManager/ApplicationMaster in MRv2.
JobTracker receives the requests for MapReduce execution from the client.
JobTracker talks to the NameNode to determine the location of the data.
JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the data) and the available slots to execute a task on a given node.
JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to the client.
JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.

Task Tracker – one per slave node

TaskTracker runs on DataNode. Mostly on all DataNodes.
TaskTracker is replaced by Node Manager in MRv2.
Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in execution.
TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.

MRv2 daemons

Resource Manager – one per cluster

Starts Application Masters, allocates resources on slave nodes

Application Master – one per job

Requests resources, manages individual Map and Reduce tasks

Node Manager – one per slave node

Manages resources on individual slave nodes

Job History – one per cluster

Archives jobs’ metrics and metadata

Note: Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not recommended; it will degrade your performance and may result in an unstable MapReduce cluster deployment.

To start and stop MapReduce daemon (MR1v)

To start MapReduce daemon manually, enter the following command:

# $HADOOP_HOME/bin/start-mapred.sh

To stop MapReduce daemon manually, enter the following command:

# $HADOOP_HOME/bin/stop-mapred.sh

3. Phases: Driver, Mapper, Shuffle/Sort, and Reducer

The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.

The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.

The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitionerfor data split.

The partitioner decides which reducer will get a particular key value pair.

The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.

4. First MapReduce job

Run Your Map-Reduce code in Eclipse ( WordCount in Eclipse)

detailed steps with screen shot are mentioned.

A hadoop Blog: Blog Link FB Page:Hadoop Quiz

Comment for update or changes..

Reference;

Definitive guide 4th edition, Cloudera and hadoop blogs.

Hadoop Developer Self Learning Outline MapReduce

MapReduce | Hadoop Developer Self Learning Outline

Post a Comment