HDFS Basic Principles

Hadoop Distributed File System (HDFS) implements reliable and distributed read/write of massive amounts of data. HDFS is applicable to the scenario where data read/write features "write once and read multiple times". However, the write operation is performed in sequence, that is, it is a write operation performed during file creation or an adding operation performed behind the existing file. HDFS ensures that only one caller can perform write operation on a file but multiple callers can perform read operation on the file at the same time.

Architecture

HDFS consists of active and standby NameNodes and multiple DataNodes, as shown in Figure 1.

HDFS works in master/slave architecture. NameNodes run on the master (active) node, and DataNodes run on the slave (standby) node. ZKFC should run along with the NameNodes.

The communication between NameNodes and DataNodes is based on Transmission Control Protocol (TCP)/Internet Protocol (IP). The NameNode, DataNode, ZKFC, and JournalNode can be deployed on Linux servers.

Figure 1 HA HDFS architecture

Table 1 describes the functions of each module shown in Figure 1.

Table 1 Module description

Module

Description

NameNode

A NameNode is used to manage the namespace, directory structure, and metadata information of a file system and provide the backup mechanism. The NameNode is classified into the following two types:

  • Active NameNode: manages the namespace, maintains the directory structure and metadata of file systems, and records the mapping relationships between data blocks and files to which the data blocks belong.
  • Standby NameNode: synchronizes with the data in the active NameNode, and takes over services from the active NameNode when the active NameNode is faulty.
  • Observer NameNode: synchronizes with the data in the active NameNode, and processes read requests from the client.

DataNode

A DataNode is used to store data blocks of each file and periodically report the storage status to the NameNode.

JournalNode

In HA cluster, synchronizes metadata between the active and standby NameNodes.

ZKFC

ZKFC must be deployed for each NameNode. It monitors NameNode status and writes status information to ZooKeeper. ZKFC also has permissions to select the active NameNode.

ZK Cluster

ZooKeeper is a coordination service which helps the ZKFC to elect the active NameNode.

HttpFS gateway

HttpFS is a single stateless gateway process which provides the WebHDFS REST API for external processes and FileSystem API for the HDFS. HttpFS is used for data transmission between different versions of Hadoop. It is also used as a gateway to access the HDFS behind a firewall.

Principle

MRS uses the HDFS copy mechanism to ensure data reliability. One backup file is automatically generated for each file saved in HDFS, that is, two copies are generated in total. The number of HDFS copies can be queried using the dfs.replication parameter.

Figure 3 HDFS architecture

The HDFS component of MRS supports the following features:

For details about the Hadoop architecture and principles, see https://hadoop.apache.org/.