The performance optimization effect is verified by comparing actual values with the baseline data. Therefore, determining optimal job baseline is critical to performance optimization.
When determining the job baseline, comply with the following rules:
Enable all nodes to handle tasks as actively as they can when a job is executed. Maximizing the number of concurrent tasks helps make full use of resources. You can achieve this purpose by adjusting the data volume to be processed and the number of Map and Reduce tasks.
You can set mapreduce.job.reduces to control the number of Reduce tasks.
The number of Map tasks depends on the InputFormat type and whether the data file to be processed can be split. By default, TextFileInputFormat allocates Map tasks based on the number of blocks, that is, one Map task for each block. You can adjust the following parameters to improve resource utilization.
Parameter portal:
On the All Configurations page of the Yarn service, enter a parameter name in the search box. For details, see Modifying Cluster Service Configuration Parameters.
Parameter |
Description |
Default Value |
---|---|---|
mapreduce.input.fileinputformat.split.maxsize |
Indicates the maximum size of the data block into which the Map input information is to be split. The shard size can be calculated based on its size customized by the user and the block size of each file. The formula is as follows: splitSize = Math.max(minSize, Math.min(maxSize, blockSize)) If maxSize is bigger than blockSize, a block is a shard. If maxSize is smaller than blockSize, a block will be split into multiple shards. If the size of the remaining data in a block is smaller than splitSize, the remaining data will be treated as a separated shard. |
- |
mapreduce.input.fileinputformat.split.minsize |
Indicates the minimum size of a data shard. |
0 |
If each Map or Reduce task of a job takes only a few seconds, most time of the job is wasted on scheduling tasks and starting and stopping processes. Therefore, you need to increase the data volume to be processed in each task. The preferred processing time for each task is 1 minute.
You can configure the following parameters to adjust the processing time in a task.
Parameter portal:
On the All Configurations page of the Yarn service, enter a parameter name in the search box. For details, see Modifying Cluster Service Configuration Parameters.
Parameter |
Description |
Default Value |
---|---|---|
mapreduce.input.fileinputformat.split.maxsize |
Indicates the maximum size of the data block into which the Map input information is to be split. The shard size can be calculated based on its size customized by the user and the block size of each file. The formula is as follows: splitSize = Math.max(minSize, Math.min(maxSize, blockSize)) If maxSize is bigger than blockSize, a block is a shard. If maxSize is smaller than blockSize, a block will be split into multiple shards. If the size of the remaining data in a block is smaller than splitSize, the remaining data will be treated as a separated shard. |
- |
mapreduce.input.fileinputformat.split.minsize |
Indicates the minimum size of a data shard. |
0 |