doc-exports/docs/mrs/umn/ALM-18024.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

81 lines
10 KiB
HTML

<a name="ALM-18024"></a><a name="ALM-18024"></a>
<h1 class="topictitle1">ALM-18024 Pending Yarn Memory Usage Exceeds the Threshold</h1>
<div id="body1594113689592"><div class="section" id="ALM-18024__section31658481"><h4 class="sectiontitle">Description</h4><p id="ALM-18024__p3566950121720">The alarm module checks the pending memory of Yarn every 60 seconds. The alarm is generated when the pending memory exceeds the threshold. Pending memory indicates the total memory that is not allocated to submitted Yarn applications.</p>
</div>
<div class="section" id="ALM-18024__section16490876"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-18024__table7825795184" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-18024__row10829199161819"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-18024__p7830149181817">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-18024__p4832169171818">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-18024__p7834295185">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-18024__row11834698184"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-18024__p138359915188">18024</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-18024__p108361599186">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-18024__p1083810991819">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-18024__section14200159"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-18024__table15448152818187" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-18024__row2451192861813"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-18024__p445318287184">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-18024__p14455152871817">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-18024__row077613291817"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18024__p17935380415">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18024__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18024__row8457102815185"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18024__p17459122801816">QueueName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18024__p8460192821819">Identifies the queue for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18024__row846182891817"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18024__p1546213282187">QueueMetric</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18024__p8462142814188">Identifies the queue indicator for which the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-18024__section60692571"><h4 class="sectiontitle">Impact on the System</h4><ul id="ALM-18024__ul8914113131715"><li id="ALM-18024__li591413314174">It takes long time to end an application.</li><li id="ALM-18024__li1242610612171">A new application cannot run after submission.</li></ul>
</div>
<div class="section" id="ALM-18024__section9362234"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-18024__ul29000801"><li id="ALM-18024__li2292055">NodeManager node resources are insufficient.</li><li id="ALM-18024__li13582173317229">The maximum resource capacity of the queue and the maximum AM resource percentage are too small.</li><li id="ALM-18024__li945917112045">The monitoring threshold is too small.</li></ul>
</div>
<div class="section" id="ALM-18024__section18537579256"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-18024__p11781195754620"><strong id="ALM-18024__b167811057124615">Check NodeManager resources.</strong></p>
<ol id="ALM-18024__ol11781057184618"><li id="ALM-18024__li9781125714466"><span>On FusionInsight Manager, choose <strong id="ALM-18024__b17929172141611">Cluster</strong> &gt; <em id="ALM-18024__i1930192114169">Name of the desired cluster</em> &gt; <strong id="ALM-18024__b11931221161616">Services</strong> &gt; <strong id="ALM-18024__b4932721161619">Yarn</strong> &gt; <strong id="ALM-18024__b12934721161610">ResourceManager (Active)</strong> to access the ResourceManager web UI.</span></li><li id="ALM-18024__li46851012470"><span>Click <strong id="ALM-18024__b24341125201415">Scheduler</strong> and check whether the root queue resources are used up in <strong id="ALM-18024__b12435925101420">Application Queues</strong>.</span><p><ul id="ALM-18024__ul187511556173018"><li id="ALM-18024__li675185663018">If yes, go to <a href="#ALM-18024__li1894618168247">3</a>.</li><li id="ALM-18024__li114533583111">If no, go to <a href="#ALM-18024__li156321342274">4</a>.</li></ul>
</p></li><li id="ALM-18024__li1894618168247"><a name="ALM-18024__li1894618168247"></a><a name="li1894618168247"></a><span>Expand the capacity of the NodeManager instance of the Yarn service. After the capacity expansion, check whether the alarm is cleared.</span><p><ul id="ALM-18024__ul2024294142412"><li id="ALM-18024__li172421049244">If yes, no further action is required.</li><li id="ALM-18024__li1424317422412">If no, go to <a href="#ALM-18024__li15314143611285">6</a>.</li></ul>
</p></li></ol>
<p id="ALM-18024__p8592842272"><strong id="ALM-18024__b18520356451556">Check the maximum queue resource capacity and the maximum AM resource percentage.</strong></p>
<ol start="4" id="ALM-18024__ol15633154152713"><li id="ALM-18024__li156321342274"><a name="ALM-18024__li156321342274"></a><a name="li156321342274"></a><span>Check whether the resources of the queue corresponding to the pending task are used up.</span><p><ul id="ALM-18024__ul116320432715"><li id="ALM-18024__li19632114122710">If yes, go to <a href="#ALM-18024__li1663218419278">5</a>.</li><li id="ALM-18024__li1663212411273">If no, go to <a href="#ALM-18024__li15314143611285">6</a>.</li></ul>
</p></li><li id="ALM-18024__li1663218419278"><a name="ALM-18024__li1663218419278"></a><a name="li1663218419278"></a><span>On FusionInsight Manager, choose <strong id="ALM-18024__b27144265163">Tenant Resources</strong> &gt; <strong id="ALM-18024__b12716426141614">Dynamic Resource Plan</strong> and add resources as required. Check whether the alarms are cleared.</span><p><ul id="ALM-18024__ul106325419271"><li id="ALM-18024__li0632204152712">If yes, no further action is required.</li><li id="ALM-18024__li66321941273">If no, go to <a href="#ALM-18024__li15314143611285">6</a>.</li></ul>
</p></li></ol>
<p id="ALM-18024__p1529393618286"><strong id="ALM-18024__b117911216204014">Adjust the monitoring thresholds.</strong></p>
<ol start="6" id="ALM-18024__ol10314143615285"><li id="ALM-18024__li15314143611285"><a name="ALM-18024__li15314143611285"></a><a name="li15314143611285"></a><span>On FusionInsight Manager, choose <strong id="ALM-18024__b179142064751556">O&amp;M</strong> &gt; <strong id="ALM-18024__b53308374851556">Alarm</strong> &gt; <strong id="ALM-18024__b140608667451556">Thresholds</strong> &gt; <em id="ALM-18024__i113997469351556">Name of the desired cluster</em> &gt; <strong id="ALM-18024__b50639140451556">Yarn</strong> &gt; <strong id="ALM-18024__b82140258751556">CPU and Memory</strong> &gt; <strong id="ALM-18024__b45092772151556">Pending Memory</strong>, and increase the threshold as required.</span></li><li id="ALM-18024__li163141936132814"><span>Check whether the alarm is cleared 5 minutes later.</span><p><ul id="ALM-18024__ul1314036132819"><li id="ALM-18024__li53146360282">If yes, no further action is required.</li><li id="ALM-18024__li731463652817">If no, go to <a href="#ALM-18024__li76841314475">8</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-18024__p5357154554619"><strong id="ALM-18024__b149211143942">Collect the fault information.</strong></p>
<ol start="8" id="ALM-18024__ol176841310478"><li id="ALM-18024__li76841314475"><a name="ALM-18024__li76841314475"></a><a name="li76841314475"></a><span>On FusionInsight Manager, choose <strong id="ALM-18024__b1851145213161">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-18024__b146315291613">Log</strong> &gt; <strong id="ALM-18024__b16513529165">Download</strong>.</span></li><li id="ALM-18024__li45621121134714"><span>Expand the <strong id="ALM-18024__b108734194410">Service</strong> drop-down list, and select <strong id="ALM-18024__b17881542440">Yarn</strong> for the target cluster.</span></li><li id="ALM-18024__li195647218474"><span>Click <span><img id="ALM-18024__image104601319175315" src="en-us_image_0263895617.png"></span> in the upper right corner, and set <strong id="ALM-18024__b1353914704417">Start Date</strong> and <strong id="ALM-18024__b16539978442">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-18024__b85400754418">Download</strong>.</span></li><li id="ALM-18024__li556542113476"><span>Contact <span id="ALM-18024__text12498181504310">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-18024__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-18024__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-18024__section20143465"><h4 class="sectiontitle">Related Information</h4><p id="ALM-18024__p32409199">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>