Symptom
A node is running properly and has GPU resources. However, the following error information is displayed:
0/9 nodes are available: 9 insufficient nvidia.com/gpu
Analysis
/opt/cloud/cce/nvidia/nvidia_installer.log
View standard output logs of the NVIDIA container.
Filter the container ID by running the following command:
docker ps –a | grep nvidia
View logs by running the following command:
docker logs Container ID
Run the following command to check the CUDA version in the container:
cat /usr/local/cuda/version.txt
Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.