forked from docs/doc-exports
Reviewed-by: Eotvos, Oliver <oliver.eotvos@t-systems.com> Co-authored-by: Dong, Qiu Jian <qiujiandong1@huawei.com> Co-committed-by: Dong, Qiu Jian <qiujiandong1@huawei.com>
32 lines
3.1 KiB
HTML
32 lines
3.1 KiB
HTML
<a name="cce_faq_00020"></a><a name="cce_faq_00020"></a>
|
||
|
||
<h1 class="topictitle1">How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?</h1>
|
||
<div id="body1529474173323"><div class="section" id="cce_faq_00020__section629233718586"><h4 class="sectiontitle">Did a Resource Scheduling Failure Event Occur on a Cluster Node?</h4><p id="cce_faq_00020__p770412286597"><strong id="cce_faq_00020__b116551253111">Symptom</strong></p>
|
||
<p id="cce_faq_00020__p9326151192319">A node is running properly and has GPU resources. However, the following error information is displayed:</p>
|
||
<p id="cce_faq_00020__p7442182515227">0/9 nodes are available: 9 insufficient nvidia.com/gpu</p>
|
||
<p id="cce_faq_00020__p11299340506"><strong id="cce_faq_00020__b84235270620713">Analysis</strong></p>
|
||
<ol id="cce_faq_00020__ol437653629"><li id="cce_faq_00020__li3377163328">Check whether the node is attached with NVIDIA label.<p id="cce_faq_00020__p184511501213"><a name="cce_faq_00020__li3377163328"></a><a name="li3377163328"></a><span><img id="cce_faq_00020__image7845750623" src="en-us_image_0000001898023841.png"></span></p>
|
||
<p id="cce_faq_00020__p18456509215"></p>
|
||
</li><li id="cce_faq_00020__li731883032913">Check whether the NVIDIA driver is running properly.<div class="p" id="cce_faq_00020__p1272833122919"><a name="cce_faq_00020__li731883032913"></a><a name="li731883032913"></a>Log in to the node where the add-on is running and view the driver installation log in the following path:<pre class="screen" id="cce_faq_00020__screen1376913411638">/opt/cloud/cce/nvidia/nvidia_installer.log</pre>
|
||
</div>
|
||
<p id="cce_faq_00020__p266245410316">View standard output logs of the NVIDIA container.</p>
|
||
<p id="cce_faq_00020__p13148224125115">Filter the container ID by running the following command:</p>
|
||
<pre class="screen" id="cce_faq_00020__screen19310058155015">docker ps –a | grep nvidia</pre>
|
||
<p id="cce_faq_00020__p1966316542313">View logs by running the following command:</p>
|
||
<pre class="screen" id="cce_faq_00020__screen85471945195110">docker logs <em id="cce_faq_00020__i6789161810543">Container ID</em></pre>
|
||
</li></ol>
|
||
</div>
|
||
<div class="section" id="cce_faq_00020__section13331111992411"><h4 class="sectiontitle">What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?</h4><p id="cce_faq_00020__p682917160543">Run the following command to check the CUDA version in the container:</p>
|
||
<pre class="screen" id="cce_faq_00020__screen494951975412">cat /usr/local/cuda/version.txt</pre>
|
||
<p id="cce_faq_00020__p6814323122612">Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.</p>
|
||
</div>
|
||
<div class="section" id="cce_faq_00020__section16392113515592"><h4 class="sectiontitle">Helpful Links</h4><p id="cce_faq_00020__p1773611445201"><a href="cce_faq_00109.html">What Should I Do If an Error Occurs When Deploying a Service on the GPU Node?</a></p>
|
||
</div>
|
||
</div>
|
||
<div>
|
||
<div class="familylinks">
|
||
<div class="parentlink"><strong>Parent topic:</strong> <a href="cce_faq_00281.html">Node Running</a></div>
|
||
</div>
|
||
</div>
|
||
|