import os os.environ["NCCL_DEBUG"] = "INFO"
The following error message is displayed.
The environment variables NCCL_IB_TC, NCCL_IB_GID_INDEX, and NCCL_IB_TIMEOUT are not configured. As a result, the communication is slow and unstable, and the IB communication is interrupted.
Add environment variables to the code.
import os os.environ["NCCL_IB_TC"] = "128" os.environ["NCCL_IB_GID_INDEX"] = "3" os.environ["NCCL_IB_TIMEOUT"] = "22"