Error Message "cuda runtime error (10) : invalid device ordinal at xxx" Displayed in Logs

Symptom

A training job failed, and the following error is displayed in logs.

Figure 1 Error log

Possible Causes

The possible causes are as follows:

Solution

  1. Perform CUDA operations on the GPUs with IDs specified by CUDA_VISIBLE_DEVICES.
  2. If a GPU on a resource node is damaged, contact technical support.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.