forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lai, Weijian <laiweijian4@huawei.com> Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
2.6 KiB
2.6 KiB
Error Message "no socket interface found" Displayed in Logs
Symptom
An NCCL debug log level is set in a distributed job executed using a PyTorch image.
import os os.environ["NCCL_DEBUG"] = "INFO"
The following error message is displayed.
Possible Causes
The environment variables NCCL_IB_TC, NCCL_IB_GID_INDEX, and NCCL_IB_TIMEOUT are not configured. As a result, the communication is slow and unstable, and the IB communication is interrupted.
Solution
Add environment variables to the code.
import os os.environ["NCCL_IB_TC"] = "128" os.environ["NCCL_IB_GID_INDEX"] = "3" os.environ["NCCL_IB_TIMEOUT"] = "22"
Parent topic: Service Code Issues