doc-exports/docs/cce/umn/cce_faq_00020.html
Dong, Qiu Jian 86fb05065f CCE UMN for 24.2.0 version -20240428
Reviewed-by: Eotvos, Oliver <oliver.eotvos@t-systems.com>
Co-authored-by: Dong, Qiu Jian <qiujiandong1@huawei.com>
Co-committed-by: Dong, Qiu Jian <qiujiandong1@huawei.com>
2024-06-10 08:19:07 +00:00

32 lines
3.1 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<a name="cce_faq_00020"></a><a name="cce_faq_00020"></a>
<h1 class="topictitle1">How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?</h1>
<div id="body1529474173323"><div class="section" id="cce_faq_00020__section629233718586"><h4 class="sectiontitle">Did a Resource Scheduling Failure Event Occur on a Cluster Node?</h4><p id="cce_faq_00020__p770412286597"><strong id="cce_faq_00020__b116551253111">Symptom</strong></p>
<p id="cce_faq_00020__p9326151192319">A node is running properly and has GPU resources. However, the following error information is displayed:</p>
<p id="cce_faq_00020__p7442182515227">0/9 nodes are available: 9 insufficient nvidia.com/gpu</p>
<p id="cce_faq_00020__p11299340506"><strong id="cce_faq_00020__b84235270620713">Analysis</strong></p>
<ol id="cce_faq_00020__ol437653629"><li id="cce_faq_00020__li3377163328">Check whether the node is attached with NVIDIA label.<p id="cce_faq_00020__p184511501213"><a name="cce_faq_00020__li3377163328"></a><a name="li3377163328"></a><span><img id="cce_faq_00020__image7845750623" src="en-us_image_0000001898023841.png"></span></p>
<p id="cce_faq_00020__p18456509215"></p>
</li><li id="cce_faq_00020__li731883032913">Check whether the NVIDIA driver is running properly.<div class="p" id="cce_faq_00020__p1272833122919"><a name="cce_faq_00020__li731883032913"></a><a name="li731883032913"></a>Log in to the node where the add-on is running and view the driver installation log in the following path:<pre class="screen" id="cce_faq_00020__screen1376913411638">/opt/cloud/cce/nvidia/nvidia_installer.log</pre>
</div>
<p id="cce_faq_00020__p266245410316">View standard output logs of the NVIDIA container.</p>
<p id="cce_faq_00020__p13148224125115">Filter the container ID by running the following command:</p>
<pre class="screen" id="cce_faq_00020__screen19310058155015">docker ps a | grep nvidia</pre>
<p id="cce_faq_00020__p1966316542313">View logs by running the following command:</p>
<pre class="screen" id="cce_faq_00020__screen85471945195110">docker logs <em id="cce_faq_00020__i6789161810543">Container ID</em></pre>
</li></ol>
</div>
<div class="section" id="cce_faq_00020__section13331111992411"><h4 class="sectiontitle">What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?</h4><p id="cce_faq_00020__p682917160543">Run the following command to check the CUDA version in the container:</p>
<pre class="screen" id="cce_faq_00020__screen494951975412">cat /usr/local/cuda/version.txt</pre>
<p id="cce_faq_00020__p6814323122612">Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.</p>
</div>
<div class="section" id="cce_faq_00020__section16392113515592"><h4 class="sectiontitle">Helpful Links</h4><p id="cce_faq_00020__p1773611445201"><a href="cce_faq_00109.html">What Should I Do If an Error Occurs When Deploying a Service on the GPU Node?</a></p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="cce_faq_00281.html">Node Running</a></div>
</div>
</div>