Hadoop Introduction 2 | Hadoop Developer Self Learning


With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on it.
In this post I am going to cover following topic

Pre knowledge: Understanding Big Data

Hadoop Introduction
  • Hadoop history and concepts
  • Ecosystem
Above two topics are covered in part one of  Hadoop Introduction. In this post we are going to look forward about the hadoop distributions, factors need to be consider while choosing them and Hadoop high level architecture
  • Distributions
  • High level architecture
Distributions
hadoop is apache top project.
Different vendor worked on hadoop and developed a distribution.One should be very specific about choosing this distribution.
you can refer below consideration for selecting vendor.


Top Hadoop distributor are described here . few are them are free, few are premium and few are free + premium ex. cloudera
  • Amazon Elastic MapReduce 
  • Cloudera CDH Hadoop Distribution 
  • Hortonworks Data Platform (HDP) 
  • MapR Hadoop Distribution 
  • IBM Open Platform 
  • Microsoft Azure's HDInsight -Cloud based Hadoop Distrbution 
  • Pivotal Big Data Suite 
  • Datameer Professional 
  • Datastax Enterprise Analytics 
  • Dell- Cloudera Apache Hadoop Solution
few popular vendor and there recent releases

Vendor  Product evaluated Product version evaluated
Cloudera  Cloudera Enterprise 5.5
Hortonworks Hortonworks Data Platform 2.3
 IBM  IBM BigInsights for Apache Hadoop 4.1
MapR Technologies  The MapR Distribution including Apache 5
Pivotal Software HadoopPivotal HD 3.x

High level Architecture
Hadoop 1.0 architecture is shown below
high level architecture

Core Component

  • HDFS (Hadoop Distributed File System)
         -  Distributed Storage
  • MR framework (MapReduce)
         -  Parallel Processing/Computing


architecture component

 Hadoop 1.0 and 2.0:

Yarn is introduced in Hadoop 2.0.

architecture

Description about component

Apache HDFS
The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
Apache MapReduce
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.
                                                                                                                          (Source: Github)

A hadoop Blog: Blog Link
FB Page:Hadoop Quiz
Comment for update or changes..