Files
doc-exports/docs/mrs/umn/ALM-14016.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

96 lines
13 KiB
HTML

<a name="ALM-14016"></a><a name="ALM-14016"></a>
<h1 class="topictitle1">ALM-14016 DataNode Direct Memory Usage Exceeds the Threshold</h1>
<div id="body23979669"><div class="section" id="ALM-14016__section7978296"><h4 class="sectiontitle">Description</h4><p id="ALM-14016__p63305055">The system checks the direct memory usage of HDFS every 30 seconds. This alarm is generated when the direct memory usage of DataNode instances exceeds the threshold (90% of the maximum memory).</p>
<p id="ALM-14016__p32874586">This alarm is automatically cleared when the direct memory usage is less than the threshold. </p>
</div>
<div class="section" id="ALM-14016__section4695804"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14016__table45595801" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14016__row11217243"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14016__p36181452">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14016__p45016494">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14016__p22457384">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14016__row7108835"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14016__p38944775">14016</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14016__p410170">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14016__p33223847">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14016__section42262236"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14016__table6777076" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14016__row9420492"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14016__p24862365">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14016__p585716">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14016__row684592262710"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14016__p156438591896">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14016__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14016__row47442998"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14016__p65062640">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14016__p22603169">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14016__row2101933"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14016__p35626567">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14016__p33468688">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14016__row32782743"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14016__p51620924">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14016__p3670791">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14016__row33037121"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14016__p58761149">Trigger Condition</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14016__p62032625">Specifies the threshold for triggering the alarm.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14016__section44815811"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14016__p58586758">If the available direct memory of DataNode instances is insufficient, a memory overflow may occur and the service breaks down.</p>
</div>
<div class="section" id="ALM-14016__section689121"><h4 class="sectiontitle">Possible Causes</h4><p id="ALM-14016__p47906980">The direct memory of DataNode instances is overused or the direct memory is inappropriately allocated.</p>
</div>
<div class="section" id="ALM-14016__section6202095"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-14016__p55260146"><strong id="ALM-14016__b2299670893719">Check the direct memory usage.</strong></p>
<ol id="ALM-14016__ol3810088993733"><li id="ALM-14016__li551053093722"><span>On the <strong id="ALM-14016__b12307162712516">Home</strong> page of FusionInsight Manager, choose <strong id="ALM-14016__b104791231155819">O&amp;M</strong> &gt; <strong id="ALM-14016__b99388376587">Alarm</strong> &gt; <strong id="ALM-14016__b16949184112583">Alarms</strong>. On the page that is displayed, click the drop-down list in the row containing <strong id="ALM-14016__b354913541025">ALM-14016 DataNode Direct Memory Usage Exceeds the Threshold</strong>, and view the role name and IP address of the instance for which the alarm is generated in the <strong id="ALM-14016__b797112284316">Location</strong> area.</span></li><li id="ALM-14016__li4959477593722"><span>On the <strong id="ALM-14016__b08961571155">Home</strong> page of FusionInsight Manager, choose <strong id="ALM-14016__b1246224315514">Cluster</strong> &gt; <strong id="ALM-14016__b13675176067">Services</strong> &gt; <strong id="ALM-14016__b19663410065">HDFS</strong>. On the page that is displayed, click the <strong id="ALM-14016__b768517351063">Instance</strong> tab. In the instance list, select <strong id="ALM-14016__b88481413141411">DataNode</strong> (IP address of the instance for which this alarm is generated). Click the drop-down list in the upper right corner of the chart, choose <strong id="ALM-14016__b139359324159">Customize</strong> &gt; <strong id="ALM-14016__b6541143741510">Resource</strong>, and select <strong id="ALM-14016__b321901811616">DataNode Memory</strong> to check the direct memory usage.</span></li><li id="ALM-14016__li2614638593722"><span>Check whether the used direct memory of a DataNode instance reaches 90% (default threshold) of the maximum direct memory allocated to it.</span><p><ul class="subitemlist" id="ALM-14016__ul2527477593722"><li id="ALM-14016__li5775381693722">If yes, go to <a href="#ALM-14016__li3399087993722">4</a>.</li><li id="ALM-14016__li4754755193722">If no, go to <a href="#ALM-14016__li5838381193722">8</a>.</li></ul>
</p></li><li id="ALM-14016__li3399087993722"><a name="ALM-14016__li3399087993722"></a><a name="li3399087993722"></a><span>On the <strong id="ALM-14016__b17828194392017">Home</strong> page of FusionInsight Manager, choose <strong id="ALM-14016__b1958765334101524">Cluster</strong> &gt; <strong id="ALM-14016__b786805895101524">Services</strong> &gt; <strong id="ALM-14016__b930780075101524">HDFS</strong>. On the page that is displayed, click the <strong id="ALM-14016__b1087737473101524">Configuration</strong> tab then the <strong id="ALM-14016__b1838340475101524">All Configurations</strong> sub-tab, and select <strong id="ALM-14016__b1850743234101524">DataNode</strong> &gt; <strong id="ALM-14016__b232007784101524">System</strong>. Check whether <strong id="ALM-14016__b915021312351">-XX:MaxDirectMemorySize</strong> exists in the <strong id="ALM-14016__b279711161358">GC_OPTS</strong> parameter.</span><p><ul id="ALM-14016__ul568612458592"><li id="ALM-14016__li196861458591">If yes, go to <a href="#ALM-14016__li062164310159">5</a>.</li><li id="ALM-14016__li225719536590">If no, go to <a href="#ALM-14016__li111010376180">6</a>.</li></ul>
</p></li><li id="ALM-14016__li062164310159"><a name="ALM-14016__li062164310159"></a><a name="li062164310159"></a><span>Adjust the value of <strong id="ALM-14016__b1529916238424">-XX:MaxDirectMemorySize</strong>.</span><p><ol type="a" id="ALM-14016__ol1046720113425"><li id="ALM-14016__li193881210114215">In <strong id="ALM-14016__b1189412914219">GC_OPTS</strong>, check the value of <strong id="ALM-14016__b99411634201420">-Xmx</strong> and check whether the node memory is sufficient.<div class="note" id="ALM-14016__note15134329433"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-14016__p91346220431">You can determine whether the node memory is sufficient based on the actual environment. For example, you can use the following method:</p>
<div class="p" id="ALM-14016__p169471348195217">Use the IP address to log in to the instance for which the alarm is generated as user <strong id="ALM-14016__b10467624920">root</strong> and run the <strong id="ALM-14016__b5315482441">free -g</strong> command to check the value of <strong id="ALM-14016__b13691412102116">Mem</strong> in the <strong id="ALM-14016__b1846323222113">free</strong> column. The value indicates the available memory of the node. In the following example, the available memory of the node is 4 GB.<pre class="screen" id="ALM-14016__screen1057163313454"> total used <strong id="ALM-14016__b5364184574919">free</strong> shared buff/cache available
Mem: 112 48 <strong id="ALM-14016__b3531447104917"> 4</strong> 10 58 46
......</pre>
</div>
<p id="ALM-14016__p397165165216">If the value of <strong id="ALM-14016__b208692614278">Mem</strong> is at least that of <strong id="ALM-14016__b19667154413241">-Xmx</strong>, the node memory is sufficient. If the value of <strong id="ALM-14016__b125717338273">Mem</strong> is less than that of <strong id="ALM-14016__b425933372711">-Xmx</strong>, the node memory is insufficient.</p>
</div></div>
<ul id="ALM-14016__ul96920471752"><li id="ALM-14016__li1831167161">If yes, change the value of <strong id="ALM-14016__b14720159162810">-XX:MaxDirectMemorySize</strong> to that of <strong id="ALM-14016__b1827321712812">-Xmx</strong>.</li><li id="ALM-14016__li106926471656">If no, increase <strong id="ALM-14016__b1838345082812">-XX:MaxDirectMemorySize</strong> to a value no larger than that of <strong id="ALM-14016__b6531164012295">Mem</strong>.</li></ul>
</li><li id="ALM-14016__li3147163785717">Save the configuration and restart the DataNode instances.</li></ol>
</p></li><li id="ALM-14016__li111010376180"><a name="ALM-14016__li111010376180"></a><a name="li111010376180"></a><span>Check whether <strong id="ALM-14016__b10933444318">ALM-14008 DataNode Heap Memory Usage Exceeds the Threshold</strong> exists.</span><p><ul id="ALM-14016__ul485114212419"><li id="ALM-14016__li10851162145">If yes, rectify the fault by referring to <strong id="ALM-14016__b8704239143114">ALM-14008 DataNode Heap Memory Usage Exceeds the Threshold</strong>.</li><li id="ALM-14016__li55281914181913">If no, go to <a href="#ALM-14016__li5868287393722">7</a>.</li></ul>
</p></li><li id="ALM-14016__li5868287393722"><a name="ALM-14016__li5868287393722"></a><a name="li5868287393722"></a><span>Check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14016__ul3552166893722"><li id="ALM-14016__li179779193722">If yes, no further action is required.</li><li id="ALM-14016__li1140339293722">If no, go to <a href="#ALM-14016__li5838381193722">8</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14016__p5125956193722"><strong id="ALM-14016__b1435719093739">Collect the fault information.</strong></p>
<ol start="8" id="ALM-14016__ol3579404993743"><li id="ALM-14016__li5838381193722"><a name="ALM-14016__li5838381193722"></a><a name="li5838381193722"></a><span>On FusionInsight Manager, choose <strong id="ALM-14016__b1229613103210">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-14016__b93991311326">Log</strong> &gt; <strong id="ALM-14016__b154071323213">Download</strong>.</span></li><li id="ALM-14016__li5569225893722"><span>Expand the <strong id="ALM-14016__b4436203320">Service</strong> drop-down list, and select <strong id="ALM-14016__b55252023211">DataNode</strong> for the target cluster.</span></li><li id="ALM-14016__li3146827893722"><span>Click <span><img id="ALM-14016__image104601319175315" src="en-us_image_0263895680.png"></span> in the upper right corner, and set <strong id="ALM-14016__b8788359322">Start Date</strong> and <strong id="ALM-14016__b9807354323">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14016__b581193553214">Download</strong>.</span></li><li id="ALM-14016__li6590256993722"><span>Contact <span id="ALM-14016__text126301214142412">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-14016__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14016__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-14016__section55818863"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14016__p46093759">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>