Generally, multiple services are deployed in a cluster, and the storage of most services depends on the HDFS file system. Different components such as Spark and Yarn or clients are constantly writing files to the same HDFS directory when the cluster is running. However, the number of files in a single directory in HDFS is limited. Users must plan to prevent excessive files in a single directory and task failure.
You can set the number of files in a single directory using the dfs.namenode.fs-limits.max-directory-items parameter in HDFS.
Parameter |
Description |
Default Value |
---|---|---|
dfs.namenode.fs-limits.max-directory-items |
Maximum number of items in a directory Value range: 1 to 6,400,000 |
1048576 |
Plan data storage in advance based on time and service type categories to prevent excessive files in a single directory. You are advised to use the default value, which is about 1 million pieces of data in a single directory.