forked from docs/doc-exports
Reviewed-by: Kovács, Zoltán <zkovacs@t-systems.com> Co-authored-by: Dong, Qiu Jian <qiujiandong1@huawei.com> Co-committed-by: Dong, Qiu Jian <qiujiandong1@huawei.com>
3.1 KiB
3.1 KiB
How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?
Did a Resource Scheduling Failure Event Occur on a Cluster Node?
Symptom
A node is running properly and has GPU resources. However, the following error information is displayed:
0/9 nodes are available: 9 insufficient nvidia.com/gpu
Analysis
- Check whether the node is attached with NVIDIA label.
- Check whether the NVIDIA driver is running properly.Log in to the node where the add-on is running and view the driver installation log in the following path:
/opt/cloud/cce/nvidia/nvidia_installer.log
View standard output logs of the NVIDIA container.
Filter the container ID by running the following command:
docker ps –a | grep nvidia
View logs by running the following command:
docker logs Container ID
What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?
Run the following command to check the CUDA version in the container:
cat /usr/local/cuda/version.txt
Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.
Parent topic: Node Running