Tuesday, June 27, 2017

HDFS2 | Hadoop Developer Self Learning Outline

With reference to my earlier post related to Hadoop Developer Self Learning Outline.
I am going to write short and simple tutorial on HDFS
In this post I am going to cover following topic.

  1. Read / write path
  2. Navigating HDFS UI
  3. Command-line interaction with HDFS
  4. File systems abstractions
  5. Reading / writing files using Java API
  6. Latest in HDFS
  7. Namenode HA and Federation
1. Read Path

[1]   The client program starts with Hadoop library jar and copy of cluster configuration data, that specifies the location of the name node.
[2]   The client begins by contact the node node indicating the file it wants to read.
[3]   The name node will validate clients identity, either by simply trusting client or using authentication protocol such as Kerberos.
[4]   The client identity is verified against the owner and permission of the file.
[5]   Namenode responds to the client with the first block ID and the list of data nodes on which a copy of the block can be found, sorted by their distance to the client, Distance to the client is measured according to Hadoop's rack topology
[6]   With the block IDS and datanode hostnames, the client can now contact the most appropriate datanode directly and read the block data it needs. This process repeats until all the blocks in the file have been read or the client closes the file stream.



FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/path/to/file.ext");
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
}FSDataInputStream in = fileSystem.open(path);
int numBytes = 0;
while ((numBytes = in.read(b))> 0) {
System.out.prinln((char)numBytes));// code to manipulate the data which is read}

Write Path

[1]   Client makes a request to open a file for wringing using the Hadoop FileSystem APIs.
[2]   A request is sent to the name node to create the file metadata if the user has the necessary permission to do so. However, it initially has no associated blocks.
[3]   Namenode responds to the client indicating that the request was successful and it should start writing data.
[4]   The client library sends request to name node asking set of datanodes to which data should be written, it gets a list from name node
[5]   The client makes connection to first data node, which in turn makes connection to second and second datanode makes connection to third.
[6]   The client starts writing data to first data node, the first data node writes data to disk as well as to the input stream pointing to second data node. The second data node writes the data the disk and writes to the connection pointing to third data node and so on.
[7]   Once client is finished writing it indicates closing of the stream that flushes data and writes to disk.
Write HDFS


FileSystem fileSystem = FileSystem.get(conf);
// Check if the file already exists
Path path = new Path("/path/to/file.ext");
if (fileSystem.exists(path)) {
System.out.println("File " + dest + " already exists");
// Create a new file and write data to it.
FSDataOutputStream out = fileSystem.create(path);
InputStream in = new BufferedInputStream(new FileInputStream(
new File(source)));

byte[] b = new byte[1024];
int numBytes = 0;
while ((numBytes = in.read(b)) > 0) {
out.write(b, 0, numBytes);
// Close all the file descripters

2. Navigating HDFS UI
Apart from CLI you can , Hadoop also provide UI access

Web UIs for the Common User
The default Hadoop ports are as follows:

Default Port
Configuration Parameter
Backup/Checkpoint node?

3. Command-line interaction with HDFS

To start HDFS-CLI, run the following command:
                                  java -jar hdfs-cli-0.0.1-SNAPSHOT.jar

A detailed doc related to basic hdfs  command will be added soon

Using CLI all commands can be executed llets take and example of Hadoop Namenode Command

Hadoop Namenode Commands older version
hadoop namenode -format
Format HDFS filesystem from Namenode
hadoop namenode -upgrade
Upgrade the NameNode
Start HDFS Daemons
Stop HDFS Daemons
Start MapReduce Daemons
Stop MapReduce Daemons
hadoop namenode -recover -force
Recover namenode metadata after a cluster failure (may lose data)

All other Hadoop command can be found at Command Centre
Reference Material for Hadoop command line interface 

Areas Where HDFS Is Not a Good Fit Today
  • Low-latency data access 
  • Lots of small files 
  • Multiple writers, arbitrary file modifications

4. Abstracting the Block in HDFS

A block is the minimum amount of data that can be read or written. 64 MB is the default. Files in HDFS are broken into block-sized chunks, which are stored as independent units. HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
5. Benefits of Block Abstraction
A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

6. Reading / writing files using Java API:
This topic is covered in upper section in hdfs read and write. JAVA APIs are provided to make the call.

7. Latest in HDFS
Apache Hadoop 3.0.0-alpha2 is released

8. Namenode HA and Federation

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standbystate. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary. 
In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.

In Short HA means
Two Name Nodes: Active and Standby

Classic mode
– One NameNode
– One “helper” node called Secondary Name Node
– Bookkeeping, not backup

Detailed DOC can be found here 

HDFS federation
HDFS federation, introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user, say, and a second namenode might handle files under /share.
Under federation, each namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the filesin the namespace.

Top Trending Udemy courses at huge discount

The Web Developer Bootcamp https://goo.gl/w27ZHY

The Complete Web Developer Course 2.0 https://goo.gl/8MvdLq

AWS Certified Solutions Architect - Associate

A hadoop Blog: Blog Link            FB Page:Hadoop Quiz
Comment for update or changes..