You need to configure the nodes for storing HDFS file data blocks based on data features. You can configure a label expression to an HDFS directory or file and assign one or more labels to a DataNode so that file data blocks can be stored on specified DataNodes.
If the label-based data block placement policy is used for selecting DataNodes to store the specified files, the DataNode range is specified based on the label expression. Then proper nodes are selected from the specified range.
This section applies to MRS 3.x or later.
After cross-AZ HA is enabled for a single cluster, the HDFS NodeLabel function cannot be configured.
When different application data is required to run on different nodes for separate management, label expressions can be used to achieve separation of different services, storing specified services on corresponding nodes.
By configuring the NodeLabel feature, you can perform the following operations:
In a heterogeneous cluster, customers need to allocate certain nodes with high availability to store important commercial data. Label expressions can be used to specify replica location so that the replica can be placed on a high reliable node.
Data blocks in the /data directory have three replicas by default. In this case, at least one replica is stored on a node of RACK1 or RACK2 (nodes of RACK1 and RACK2 are high reliable), and the other two are stored separately on the nodes of RACK3 and RACK4.
Run the hdfs nodelabel -setLabelExpression -expression 'LabelA||LabelB[fallback=NONE],LabelC,LabelD' -path /data command to set an expression for the /data directory.
When data is to be written to the /data directory, at least one data block replica is stored on a node labeled with the LabelA or LabelB, and the other two data block replicas are stored separately on the nodes labeled with the LabelC and LabelD.
Go to the All Configurations page of HDFS and enter a parameter name in the search box by referring to Modifying Cluster Service Configuration Parameters.
Parameter |
Description |
Default Value |
---|---|---|
dfs.block.replicator.classname |
Used to configure the DataNode policy of HDFS. To enable the NodeLabel function, set this parameter to org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeLabel. |
org.apache.hadoop.hdfs.server.blockmanagement.AvailableSpaceBlockPlacementPolicy |
host2tags |
Used to configure a mapping between a DataNode host and a label. The host name can be configured with an IP address extension expression (for example, 192.168.1.[1-128] or 192.168.[2-3].[1-128]) or a regular expression (for example, /datanode-[123]/ or /datanode-\d{2}/) starting and ending with a slash (/). The label configuration name cannot contain the following characters: = / \ Note: The IP address must be a service IP address. |
- |
Assume there are 20 DataNodes which range from dn-1 to dn-20 in a cluster and the IP addresses of clusters range from 10.1.120.1 to 10.1.120.20. The value of host2tags can be represented in either of the following methods:
Regular expression of the host name
/dn-\d/ = label-1 indicates that the labels corresponding to dn-1 to dn-9 are label-1, that is, dn-1 = label-1, dn-2 = label-1, ..., dn-9 = label-1.
/dn-((1[0-9]$)|(20$))/ = label-2 indicates that the labels corresponding to dn-10 to dn-20 are label-2, that is, dn-10 = label-2, dn-11 = label-2, ...dn-20 = label-2.
IP address range expression
10.1.120.[1-9] = label-1 indicates that the labels corresponding to 10.1.120.1 to 10.1.120.9 are label-1, that is, 10.1.120.1 = label-1, 10.1.120.2 = label-1, ..., and 10.1.120.9 = label-1.
10.1.120.[10-20] = label-2 indicates that the labels corresponding to 10.1.120.10 to 10.1.120.20 are label-2, that is, 10.1.120.10 = label-2, 10.1.120.11 = label-2, ..., and 10.1.120.20 = label-2.
A newly added DataNode will be assigned a label if the IP address of the DataNode is within the IP address range in the host2tags configuration item or the host name of the DataNode matches the host name regular expression in the host2tags configuration item.
For example, the value of host2tags is 10.1.120.[1-9] = label-1, but the current cluster has only three DataNodes: 10.1.120.1 to 10.1.120.3. If DataNode 10.1.120.4 is added for capacity expansion, the DataNode is labeled as label-1. If the 10.1.120.3 DataNode is deleted or out of the service, no data block will be allocated to the node.
Nodelabel supports different placement policies for replicas. The expression label-1,label-2,label-3 indicates that three replicas are respectively placed in DataNodes containing label-1, label-2, and label-3. Different replica policies are separated by commas (,).
If you want to place two replicas in DataNode with label-1, set the expression as follows: label-1[replica=2],label-2,label-3. In this case, if the default number of replicas is 3, two nodes with label-1 and one node with label-2 are selected. If the default number of replicas is 4, two nodes with label-1, one node with label-2, and one node with label-3 are selected. Note that the number of replicas is the same as that of each replica policy from left to right. However, the number of replicas sometimes exceeds the expressions. If the default number of replicas is 5, the extra replica is placed on the last node, that is, the node labeled with label-3.
When the ACLs function is enabled and the user does not have the permission to access the labels used in the expression, the DataNode with the label is not selected for the replica.
If the number of block replicas exceeds the value of dfs.replication (number of file replicas specified by the user), HDFS will delete redundant block replicas to ensure cluster resource usage.
The deletion rules are as follows:
For example: The default number of file replicas is 3.
The label expression of /test is LA[replica=1],LB[replica=1],LC[replica=1].
The file replicas of /test are distributed on four nodes (D1 to D4), corresponding to labels (LA to LD).
D1:LA D2:LB D3:LC D4:LD
Then, block replicas on node D4 will be deleted.
For example: The default number of file replicas is 3.
The label expression of /test is LA[replica=1],LB[replica=1],LC[replica=1].
The file replicas of /test are distributed on the following four nodes, corresponding to the following labels.
D1:LA D2:LA D3:LB D4:LC
Then, block replicas on node D1 or D2 will be deleted.
Assume that there are six DataNodes, namely, dn-1, dn-2, dn-3, dn-4, dn-5, and dn-6 in a cluster and the corresponding IP address range is 10.1.120.[1-6]. Six directories must be configured with label expressions. The default number of block replicas is 3.
/dn-[1456]/ = label-1,label-2 /dn-[26]/ = label-1,label-3 /dn-[3456]/ = label-1,label-4 /dn-5/ = label-5
10.1.120.[1-6] = label-1 10.1.120.1 = label-2 10.1.120.2 = label-3 10.1.120.[3-6] = label-4 10.1.120.[4-6] = label-2 10.1.120.5 = label-5 10.1.120.6 = label-3
/dn-1/ = label-1, label-2 /dn-2/ = label-1, label-3 /dn-3/ = label-1, label-4 /dn-4/ = label-1, label-2, label-4 /dn-5/ = label-1, label-2, label-4, label-5 /dn-6/ = label-1, label-2, label-3, label-4
/dir1 = label-1 /dir2 = label-1 && label-3 /dir3 = label-2 || label-4[replica=2] /dir4 = (label-2 || label-3) && label-4 /dir5 = !label-1 /sdir2.txt = label-1 && label-3[replica=3,fallback=NONE] /dir6 = label-4[replica=2],label-2
For details about the label expression configuration, see the hdfs nodelabel -setLabelExpression command.
The file data block storage locations are as follows:
In configuration files, key and value are separated by equation signs (=), colons (:), and whitespace. Therefore, the host name of the key cannot contain these characters because these characters may be considered as separators.