doc-exports/docs/modelarts/umn/modelarts_13_0028.html
Lai, Weijian 4e4b2d5f6d ModelArts UMN 23.3.0 Version.
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Lai, Weijian <laiweijian4@huawei.com>
Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
2024-06-26 07:03:02 +00:00

1.3 KiB

ECC Error Occurs in the Log, Causing Training Job Failure

Symptom

The following error occurs during the running of the training job log: RuntimeError: CUDA error: uncorrectable ECC error encountered

Possible Cause

If a job fails to be executed due to an ECC error, the node of the job will be automatically isolated. In this case, you need to restart the job.

Solution

If this error occurs, create a training job again.