Common Issues Related to Insufficient Disk Space and Solutions

This section centrally describes common issues related to insufficient disk space and solutions to these issues.

Symptom

When data, code, or model is copied during training, error message "No space left on device" is displayed.

Figure 1 Error log

Possible Causes

The possible causes are as follows:

Solution

  1. Obtain the sizes of the dataset, decompressed dataset, and checkpoint file and check whether they have exhausted the disk space.
  2. If the volume of data exceeds the /cache size, use SFS to attach more data disks for expanding the storage size.
  3. Save the data and checkpoint in /cache or /home/ma-user/.
  4. Check the checkpoint logic and ensure that historical checkpoints are deleted so that they will not use up the storage space in /cache.
  5. If the file size is smaller than the /cache size, and the number of files exceeds 500,000, the issue may be caused by an error in the file index cache of the operating system. In this case, do as follows to resolve this issue:
    • Reduce the number of files in a single directory.
    • Slow down the file creation speed. For example, during data decompression, add a sleep period of 5s before decompressing the next piece of data.
  6. If the issue is caused by core files, add the following code at the very beginning of the boot script to disable the generation of the core files (debug code in a development environment before adding the code):
    import os
    os.system("ulimit -c 0")

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.