forked from docs/doc-exports
Reviewed-by: gtema <artem.goncharov@gmail.com> Co-authored-by: Dong, Qiu Jian <qiujiandong1@huawei.com> Co-committed-by: Dong, Qiu Jian <qiujiandong1@huawei.com>
50 lines
11 KiB
HTML
50 lines
11 KiB
HTML
<a name="cce_faq_00098"></a><a name="cce_faq_00098"></a>
|
|
|
|
<h1 class="topictitle1">Failed to Schedule an Instance</h1>
|
|
<div id="body0000001197234823"><div class="section" id="cce_faq_00098__section1197903153210"><h4 class="sectiontitle">Fault Locating</h4><ul id="cce_faq_00098__ul87865753413"><li id="cce_faq_00098__li1578617173413"><a href="#cce_faq_00098__section1678344818322">Viewing K8s Event Information</a></li><li id="cce_faq_00098__li3786167143417"><a href="#cce_faq_00098__section133416392418">Check Item 1: Checking Whether a Node Is Available in the Cluster</a></li><li id="cce_faq_00098__li378619793410"><a href="#cce_faq_00098__section29231833141817">Check Item 2: Checking Whether Node Resources (CPU and Memory) Are Sufficient</a></li><li id="cce_faq_00098__li15786117113415"><a href="#cce_faq_00098__section794092214205">Check Item 3: Checking the Affinity and Anti-Affinity Configuration of the Workload</a></li><li id="cce_faq_00098__li578617715348"><a href="#cce_faq_00098__section197421559143010">Check Item 4: Checking Whether the Workload's Volume and Node Reside in the Same AZ</a></li></ul>
|
|
</div>
|
|
<div class="section" id="cce_faq_00098__section1678344818322"><a name="cce_faq_00098__section1678344818322"></a><a name="section1678344818322"></a><h4 class="sectiontitle">Viewing K8s Event Information</h4><p id="cce_faq_00098__p470011505299">If the <span class="keyword" id="cce_faq_00098__keyword1195263911363">workload</span> is in the Unready state and reports the "<span class="uicontrol" id="cce_faq_00098__uicontrol8511142313418"><b><span class="keyword" id="cce_faq_00098__keyword113875495365">InstanceSchedulingFailed</span></b></span>" event, check the workload's K8S events to identify the cause.</p>
|
|
<p id="cce_faq_00098__p821781883313">As shown in the following figure, the K8s event is "0/163 nodes are available: 133 Insufficient memory", indicating that the memory is insufficient.</p>
|
|
<p id="cce_faq_00098__p206171744195115"><strong id="cce_faq_00098__b173851755134211">Complex scheduling failure information:</strong></p>
|
|
<ul id="cce_faq_00098__ul19947165165119"><li id="cce_faq_00098__li427355317531"><strong id="cce_faq_00098__b11295147124317">no nodes available to schedule pods</strong>: No node resource is available for scheduling workload pods.</li><li id="cce_faq_00098__li12767120175517"><strong id="cce_faq_00098__b191863481436">0/163 nodes are available: 133 Insufficient memory</strong>: The node is available but the memory is insufficient.</li><li id="cce_faq_00098__li32981325165517"><strong id="cce_faq_00098__b19269113154416">163 Insufficient cpu</strong>: The CPU is insufficient.</li><li id="cce_faq_00098__li8735182915519"><strong id="cce_faq_00098__b10801101518445">49 Insufficient nvidia.com/gpu</strong>: The nvidia.com/gpu resources are insufficient.</li><li id="cce_faq_00098__li19472513510"><strong id="cce_faq_00098__b479273114446">49 InsufficientResourceOnSingleGPU</strong>: GPU resources are insufficient.</li></ul>
|
|
<p id="cce_faq_00098__p3988112527"><strong id="cce_faq_00098__b68641467444">Information interpretation:</strong></p>
|
|
<ul id="cce_faq_00098__ul81029119527"><li id="cce_faq_00098__li12102911175212"><strong id="cce_faq_00098__b14945553124413">0/163 nodes are available</strong>: There are 163 nodes in the cluster, and no node meets the scheduling rules.</li><li id="cce_faq_00098__li0102611195210"><strong id="cce_faq_00098__b661033184513">133 Insufficient memory</strong>: The memory of 133 nodes is insufficient.</li><li id="cce_faq_00098__li171021711175216"><strong id="cce_faq_00098__b1136829124519">163 Insufficient cpu</strong>: The CPUs of 163 nodes are insufficient.</li><li id="cce_faq_00098__li510217112524"><strong id="cce_faq_00098__b188985289459">49 Insufficient nvidia.com/gpu</strong>: The GPUs of 49 nodes are insufficient.</li></ul>
|
|
<p id="cce_faq_00098__p18132230143312">The following is the fault locating procedure:</p>
|
|
</div>
|
|
<div class="section" id="cce_faq_00098__section133416392418"><a name="cce_faq_00098__section133416392418"></a><a name="section133416392418"></a><h4 class="sectiontitle">Check Item 1: Checking Whether a Node Is Available in the Cluster</h4><p id="cce_faq_00098__p1865332216103">Log in to the CCE console. In the navigation pane, choose <strong id="cce_faq_00098__b176314774610">Resource Management</strong> > <strong id="cce_faq_00098__b1776497174616">Nodes</strong> to check whether the node where the workload runs is in the available state.</p>
|
|
<p id="cce_faq_00098__p1286310506457">For example, the event "0/1 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space" indicates that the pod fails to be scheduled due to no available node.</p>
|
|
<p id="cce_faq_00098__p3105124122916"><strong id="cce_faq_00098__b198653410220">Solution</strong></p>
|
|
<ul id="cce_faq_00098__ul4865142283015"><li id="cce_faq_00098__li6865222183015">Add a node and migrate the pods to the new available node to ensure that services are running properly. Then, rectify the fault on the unavailable node. For details about the troubleshooting process, see the methods in the node FAQs.</li><li id="cce_faq_00098__li1986510222305">Create a new node or repair the faulty one.</li></ul>
|
|
</div>
|
|
<div class="section" id="cce_faq_00098__section29231833141817"><a name="cce_faq_00098__section29231833141817"></a><a name="section29231833141817"></a><h4 class="sectiontitle">Check Item 2: Checking Whether Node Resources (CPU and Memory) Are Sufficient</h4><p id="cce_faq_00098__p12581252181019">If the requested workload resources exceed the available resources of the node where the workload runs, the node cannot provide the resources required to run new pods and pod scheduling onto the node will definitely fail.</p>
|
|
<ol id="cce_faq_00098__ol463717307123"><li id="cce_faq_00098__li98681844192116"><span>On the CCE console, choose <span class="uicontrol" id="cce_faq_00098__uicontrol4854318125311"><b>Workloads</b></span> > <strong id="cce_faq_00098__b220885525315">Deployments</strong> or <strong id="cce_faq_00098__b158696575530">StatefulSets</strong> in the navigation pane, click the workload name, and click <strong id="cce_faq_00098__b285571819537">Pods</strong> and then <strong id="cce_faq_00098__b1855111818535">Events</strong> tabs to view pod events.</span><p><p id="cce_faq_00098__p41718615466">The event "0/1 nodes are available: 1 Insufficient cpu" indicates that the pod fails to be scheduled due to insufficient node resources.</p>
|
|
</p></li><li id="cce_faq_00098__li963773011129"><span>In the navigation pane, choose <strong id="cce_faq_00098__b8955113811445">Resource Management</strong> > <strong id="cce_faq_00098__b970534204413">Nodes</strong> to view available CPUs and memory of the node where the workload runs.</span><p><p id="cce_faq_00098__p7821161713460">In this example, 0.88 vCPUs and 0.8 GiB memory are available for the node.</p>
|
|
</p></li><li id="cce_faq_00098__li1015616526121"><span>In the navigation pane, choose <strong id="cce_faq_00098__b14998357175117">Workloads</strong> and click the workload name to view the workload's CPU request and memory request.</span><p><p id="cce_faq_00098__p3454227174613">In this example, the CPU request is 2 vCPUs and the memory request is 0.5 GiB. The CPU request exceeds the available CPU resources, which causes pod scheduling to fail.</p>
|
|
</p></li></ol>
|
|
<p id="cce_faq_00098__p7462875249"></p>
|
|
<p id="cce_faq_00098__p18201251142310"><strong id="cce_faq_00098__b9928102181517">Solution</strong></p>
|
|
<p id="cce_faq_00098__p1470164852310">On the ECS console, modify node specifications to expand node resources.</p>
|
|
</div>
|
|
<div class="section" id="cce_faq_00098__section794092214205"><a name="cce_faq_00098__section794092214205"></a><a name="section794092214205"></a><h4 class="sectiontitle">Check Item 3: Checking the Affinity and Anti-Affinity Configuration of the Workload</h4><p id="cce_faq_00098__p3690103662116">Inappropriate affinity policies will cause pod scheduling to fail.</p>
|
|
<pre class="screen" id="cce_faq_00098__screen99174215495">0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affintity, 1 node(s) didn't match pod anti-affinity rules.</pre>
|
|
<p id="cce_faq_00098__p5845375279"><strong id="cce_faq_00098__b1896020315359">Solution</strong></p>
|
|
<ul id="cce_faq_00098__ul113635413218"><li id="cce_faq_00098__li7319343142111">When adding <span class="uicontrol" id="cce_faq_00098__uicontrol544584293017"><b>workload-workload affinity</b></span> and <span class="uicontrol" id="cce_faq_00098__uicontrol4445042193011"><b>workload-node affinity</b></span> policies, ensure that the two types of policies do not conflict each other. Otherwise, workload deployment will fail. For example, workload deployment will fail if the following conditions are met:<p id="cce_faq_00098__p47431261224">Assumptions: An anti-affinity relationship is established between workload 1 and workload 2. Workload 1 is deployed on node 1 while workload 2 is deployed on node 2.</p>
|
|
<p id="cce_faq_00098__p77501221623">When you try to deploy workload 3 on node 3 and establish an affinity relationship with workload 2, a conflict occurs, resulting in a workload deployment failure.</p>
|
|
</li><li id="cce_faq_00098__li19819172752717">If the workload has a node affinity policy, make sure that <strong id="cce_faq_00098__b111241146151112">supportContainer</strong> in the label of the affinity node is set to <strong id="cce_faq_00098__b212511461114">true</strong>. Otherwise, pods cannot be scheduled onto the affinity node and the following event is generated:<pre class="screen" id="cce_faq_00098__screen4206153932918">No nodes are available that match all of the following predicates: MatchNode Selector, NodeNotSupportsContainer</pre>
|
|
<p id="cce_faq_00098__p4903951132618">If <strong id="cce_faq_00098__b7714749141618">supportContainer</strong> is set to <strong id="cce_faq_00098__b1815885191610">false</strong>, the scheduling fails. The following figure shows the error information.</p>
|
|
<pre class="screen" id="cce_faq_00098__screen105811150135317">0/1 nodes are available: 1</pre>
|
|
</li></ul>
|
|
</div>
|
|
<div class="section" id="cce_faq_00098__section197421559143010"><a name="cce_faq_00098__section197421559143010"></a><a name="section197421559143010"></a><h4 class="sectiontitle">Check Item 4: Checking Whether the Workload's Volume and Node Reside in the Same AZ</h4><p class="MsoNormal" id="cce_faq_00098__p55223077">Pod scheduling fails if the workload's volume and node reside in different AZs.</p>
|
|
<pre class="screen" id="cce_faq_00098__screen183575176553">0/1 nodes are available: 1 NoVolumeZoneConflict.</pre>
|
|
<p id="cce_faq_00098__p128612406323"><strong id="cce_faq_00098__b144771634193016">Solution</strong></p>
|
|
<p id="cce_faq_00098__p145982036103213">In the AZ where the workload's node resides, create a new volume. Alternatively, create an identical workload and select an automatically assigned cloud storage volume.</p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="cce_faq_00029.html">Workload Abnormalities</a></div>
|
|
</div>
|
|
</div>
|
|
|