If a job is trained on multiple nodes and suspension occurs before the job starts, add os.environ["NCCL_DEBUG"] = "INFO" to the code to view the NCCL debugging information.
The job is suspended before the NCCL debugging information is displayed in logs.
Check the code for parameters such as master_ip and rank. Ensure that these parameters are specified.
The GDR information is displayed only on certain nodes of a multi-node training job.
The possible cause of the suspension is GDR.
Set os.environ["NCCL_NET_GDR_LEVEL"] = '0' at the beginning of the program or ask the O&M personnel to add the GDR information to the affected nodes.
Communication information such as "Got completion with error 12, opcode 1, len 32478, vendor err 129" is displayed. The current network is unstable.
Add the following environment variables:
For more details, see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-timeout.