Dong, Qiu Jian f7b9a88535 CCE UMN update -20240625 version

Reviewed-by: Kovács, Zoltán <zkovacs@t-systems.com>
Co-authored-by: Dong, Qiu Jian <qiujiandong1@huawei.com>
Co-committed-by: Dong, Qiu Jian <qiujiandong1@huawei.com>

2024-09-04 11:43:54 +00:00

3.1 KiB

Raw Permalink Blame History

How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?

Did a Resource Scheduling Failure Event Occur on a Cluster Node?

Symptom

A node is running properly and has GPU resources. However, the following error information is displayed:

0/9 nodes are available: 9 insufficient nvidia.com/gpu

Analysis

Check whether the node is attached with NVIDIA label.
Check whether the NVIDIA driver is running properly.
Log in to the node where the add-on is running and view the driver installation log in the following path:
```
/opt/cloud/cce/nvidia/nvidia_installer.log
```
View standard output logs of the NVIDIA container.

Filter the container ID by running the following command:
```
docker ps –a | grep nvidia
```
View logs by running the following command:
```
docker logs Container ID
```

What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?

Run the following command to check the CUDA version in the container:

cat /usr/local/cuda/version.txt

Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.

Helpful Links

What Should I Do If an Error Occurs When Deploying a Service on the GPU Node?

Parent topic: Node Running

3.1 KiB Raw Permalink Blame History Unescape Escape

How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?

Did a Resource Scheduling Failure Event Occur on a Cluster Node?

What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?

Helpful Links

3.1 KiB

Raw Permalink Blame History