HDFS | Hadoop Developer Self Learning Outline


With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic

  1. Concepts (Distributed storage,horizontal scaling, replication, rack awareness)
  2. Architecture
  3. Namenode (function, storage, file system meta-data, and block reports)
  4. Secondary namenode
  5. Data node
  6. Configuration files
  7. Single node and multi node installation
  8. Communications / heart-beats
  9. Block manager / balancer
  10. Health check / safemode
Concepts (Distributed storage,horizontal scaling, replication, rack awareness)
1.       Hadoop Distributed File System (HDFS)

Basic Concept:
·         HDFS is a filesystem written in Java Based on Google’s GFS
·         Sits on top of a native file system Such as ext3, ext4 or xfs
·         Provides redundant storage for massive amounts of data  Using readily available, industry standard computers
·         HDFS performs best with a ‘modest’ number of large files
o   Millions, rather than billions, of files
o   Each file typically 100MB or more
·         Files in HDFS are ‘write once’
·         No random writes to files are allowed
·         HDFS is optimized for large, streaming reads of files
o   Rather than random reads
Rack awareness and HA will be covered in another post.

2.       Architecture
HDFS Architecture
Fig. HDFS Architecture

3.       Functions of a Name Node
          Manages File System Namespace
o   Maps a file name to a set of blocks
o   Maps a block to the DataNodes where it resides
          Cluster Configuration Management
          Replication Engine for Blocks
   
  NameNode Metadata
          Metadata in Memory
o   The entire metadata is in main memory
o   No demand paging of metadata
          Types of metadata
o   List of files
o   List of Blocks for each file
o   List of DataNodes for each block
o   File attributes, e.g. creation time, replication factor
          A Transaction Log
o    Records file creations, file deletions etc

4.       Secondary NameNode
          Copies FsImage and Transaction Log from Namenode to a temporary directory
          Merges FSImage and Transaction Log into a new FSImage in temporary directory
          Uploads new FSImage to the NameNode
o   Transaction Log on NameNode is purged

5.       DataNode
          A Block Server
o   Stores data in the local file system (e.g. ext3)
o   Stores metadata of a block (e.g. CRC)
o   Serves data and metadata to Clients

6.       Configuration files and Installation

Single node setup can be found at

7.       Configration files
files which need to be configured during setup are:
1.       hadoop-env.sh
2.       core-site.xml
3.       yarn-site.xml
4.       mapred-site.xml

8.       Communications / heart-beats
Heartbeats
          DataNodes send hearbeat to the NameNode
o   Once every 3 seconds
          NameNode uses heartbeats to detect DataNode failure

9.       Block manager / balancer Block Placement
          Block Report
o   Periodically sends a report of all existing blocks to the NameNode
          Facilitates Pipelining of Data
o   Forwards data to other specified DataNodes
          Current Strategy
o   One replica on local node
o   Second replica on a remote rack
o   Third replica on same remote rack
o   Additional replicas are randomly placed
          Clients read from nearest replicas
          Would like to make this policy pluggable

Replication Engine
          NameNode detects DataNode failures
o   Chooses new DataNodes for new replicas
o   Balances disk usage
o   Balances communication traffic to DataNodes

10.   Health check / safemode

A hadoop Blog: Blog Link
FB Page:Hadoop Quiz
Comment for update or changes..