doc-exports/docs/modelarts/umn/modelarts_trouble_0038.html
Lai, Weijian 4e4b2d5f6d ModelArts UMN 23.3.0 Version.
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Lai, Weijian <laiweijian4@huawei.com>
Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
2024-06-26 07:03:02 +00:00

2.6 KiB

Error Message "no socket interface found" Displayed in Logs

Symptom

An NCCL debug log level is set in a distributed job executed using a PyTorch image.
import os
os.environ["NCCL_DEBUG"] = "INFO"

The following error message is displayed.

Figure 1 Error log

Possible Causes

The environment variables NCCL_IB_TC, NCCL_IB_GID_INDEX, and NCCL_IB_TIMEOUT are not configured. As a result, the communication is slow and unstable, and the IB communication is interrupted.

Solution

Add environment variables to the code.

import os
os.environ["NCCL_IB_TC"] = "128"
os.environ["NCCL_IB_GID_INDEX"] = "3"
os.environ["NCCL_IB_TIMEOUT"] = "22"