Suspension Before Training

If a job is trained on multiple nodes and suspension occurs before the job starts, add os.environ["NCCL_DEBUG"] = "INFO" to the code to view the NCCL debugging information.

Symptom 1

The job is suspended before the NCCL debugging information is displayed in logs.

Solution 1

Check the code for parameters such as master_ip and rank. Ensure that these parameters are specified.

Symptom 2

The GDR information is displayed only on certain nodes of a multi-node training job.

The possible cause of the suspension is GDR.

Solution 2

Set os.environ["NCCL_NET_GDR_LEVEL"] = '0' at the beginning of the program or ask the O&M personnel to add the GDR information to the affected nodes.

Symptom 3

Communication information such as "Got completion with error 12, opcode 1, len 32478, vendor err 129" is displayed. The current network is unstable.

Solution 3

Add the following environment variables:

For more details, see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-timeout.