When PyTorch is used for distributed training, the following error occurs.
If data is copied before this issue occurs, data copy on all nodes is not complete at the same time. If you perform torch.distributed.init_process_group() when data copy is still in progress on certain nodes, the connection timed out.
import moxing as mox import torch torch.distributed.init_process_group() if local_rank == 0: mox.file.copy_parallel(src,dst) torch.distributed.barrier()
Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.