In the HDFS cluster, unbalanced disk usage among DataNodes may occur, for example, when new DataNodes are added to the cluster. Unbalanced disk usage may result in multiple problems. For example, MapReduce applications cannot make full use of local computing advantages, network bandwidth usage between data nodes cannot be optimal, or node disks cannot be used. Therefore, the system administrator needs to periodically check and maintain DataNode data balance.
HDFS provides a capacity balancing program Balancer. By running Balancer, you can balance the HDFS cluster and ensure that the difference between the disk usage of each DataNode and that of the HDFS cluster does not exceed the threshold. DataNode disk usage before and after balancing is shown in Figure 1 and Figure 2, respectively.
The time of the balancing operation is affected by the following two factors:
The data volume of each DataNode must be greater than (Average usage - Threshold) x Average data volume and less than (Average usage + Threshold) x Average data volume. If the actual data volume is less than the minimum value or greater than the maximum value, imbalance occurs. The system sets the largest deviation volume on all DataNodes as the total data volume to be migrated.
Therefore, for a cluster, you can estimate the time consumed by each iteration (by observing the time consumed by each iteration recorded in balancer logs) and divide the total data volume by 10 GB to estimate the task execution time.
The balancer can be started or stopped at any time.
The client has been installed.
cd /opt/client
If the cluster is in normal mode, run the su - omm command to switch to user omm.
source bigdata_env
kinit hdfs
hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
<bandwidth in bytes per second> indicates the bandwidth control value, in bytes. For example, to set the bandwidth control to 20 MB/s (the corresponding value is 20971520), run the following command:
hdfs dfsadmin -setBalancerBandwidth 20971520
bash /opt/client/HDFS/hadoop/sbin/start-balancer.sh -threshold <threshold of balancer>
-threshold specifies the deviation value of the DataNode disk usage, which is used for determining whether the HDFS data is balanced. When the difference between the disk usage of each DataNode and the average disk usage of the entire HDFS cluster is less than this threshold, the system considers that the HDFS cluster has been balanced and ends the balance task.
For example, to set deviation rate to 5%, run the following command:
bash /opt/client/HDFS/hadoop/sbin/start-balancer.sh -threshold 5
hdfs dfs -rm -f /system/balancer.id
Apr 01, 2016 01:01:01 PM Balancing took 23.3333 minutes
After you run the script in 6, the hadoop-root-balancer-Host name.out log file is generated in the client installation directory /opt/client/HDFS/hadoop/logs. You can view the following information in the log:
Enable automatic execution of the balance task
Table 1 describes the expression for modifying this parameter. * indicates consecutive time segments.
Table 1 describes the expression for modifying this parameter. * indicates consecutive time segments.
Parameter |
Parameter description |
Default Value |
---|---|---|
dfs.balancer.auto.threshold |
Specifies the balancing threshold of the disk capacity percentage. This parameter is valid only when dfs.balancer.auto.enable is set to true. |
10 |
dfs.balancer.auto.exclude.datanodes |
Specifies the list of DataNodes on which automatic disk balancing is not required. This parameter is valid only when dfs.balancer.auto.enable is set to true. |
The value is left blank by default. |
dfs.balancer.auto.bandwidthPerSec |
Specifies the maximum bandwidth (MB/s) of each DataNode for load balancing. |
20 |
dfs.balancer.auto.maxIdleIterations |
Specifies the maximum number of consecutive idle iterations of Balancer. An idle iteration is an iteration without moving blocks. When the number of consecutive idle iterations reaches the maximum number, the balance task ends. The value -1 indicates infinity. |
5 |
dfs.balancer.auto.maxDataNodesNum |
Controls the number of DataNodes that perform automatic balance tasks. Assume that the value of this parameter is N. If N is greater than 0, data is balanced between N DataNodes with the highest percentage of remaining space and N DataNodes with the lowest percentage of remaining space. If N is 0, data is balanced among all DataNodes in the cluster. |
5 |
Go to the /var/log/Bigdata/hdfs/nn/hadoop-omm-balancer-Host name.log file to view the task execution logs saved in the active NameNode.