Training Fault Tolerance Check

During model training, a training failure may occur due to a hardware fault. For hardware faults, ModelArts provides fault tolerance check to isolate faulty nodes to improve user experience in training.

The fault tolerance check involves environment pre-check and periodic hardware check. If any fault is detected during either of the checks, ModelArts automatically isolates the faulty hardware and issues the training job again. In distributed training, the fault tolerance check will be performed on all compute nodes used by the training job.

The following shows four failure scenarios, among which the failure in scenario 4 is not caused by a hardware fault. You can enable fault tolerance in the other three scenarios to automatically resume the training job.

After the faulty node is isolated, ModelArts creates a training job on new compute nodes. If the resources provided by the resource pool are limited, the re-issued training job will be queued with the highest priority. If the waiting time exceeds 30 minutes, the training job will automatically exit. This indicates that the resources are so limited that the training job cannot start. In this case, buy a dedicated resource pool to obtain dedicated resources.

If you use a dedicated resource pool to create a training job, the faulty nodes identified during the fault tolerance check will be removed. The system automatically adds healthy compute nodes to the dedicated resource pool. (This function is coming soon.)

More details of a fault tolerance check:

  1. Enabling Fault Tolerance Check
  2. Check Items and Conditions
  3. Effect of a Fault Tolerance Check
  4. After the environment pre-check is successful, any hardware fault will interrupt the user service. Add the reload ckpt code logic to the training so that the pre-trained model saved before the training is interrupted can be obtained. For details, see Resumable Training and Incremental Training.

Enabling Fault Tolerance Check

To enable fault tolerance check, enable auto restart when creating a training job.

Check Items and Conditions

Check Item

Item (Log Keyword)

Execution Condition

Requirements for a Check

Domain name detection

dns

None

The domain names of the volcano containers in the .host file in /etc/volcano are successfully resolved.

Disk size - Container root directory

disk-size root

None

The directory is greater than 32 GB.

Disk size - /dev/shm

disk-size shm

None

The directory is greater than 1 GB.

Disk size - /cache

disk-size cache

None

The directory is greater than 32 GB.

ulimit check

ulimit

An IB network is used.

  • Maximum locked memory > 16000
  • Open files > 1000000
  • Stack size > 8000
  • Maximum user processes > 1000000

GPU check

gpu-check

GPU and the v2 training engine are used.

GPUs are detected.

Effect of a Fault Tolerance Check

Using reload ckpt to Resume an Interrupted Training

With fault tolerance enabled, if a training job is restarted due to a hardware fault, you can obtain the pre-trained model in the code to restore the training to the state before the restart. To do so, add reload ckpt to the code. For details, see Resumable Training and Incremental Training.