doc-exports/docs/mrs/umn/ALM-14002.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

94 lines
14 KiB
HTML

<a name="ALM-14002"></a><a name="ALM-14002"></a>
<h1 class="topictitle1">ALM-14002 DataNode Disk Usage Exceeds the Threshold</h1>
<div id="body65288654"><div class="section" id="ALM-14002__section7524158"><h4 class="sectiontitle">Description</h4><p id="ALM-14002__p53889626">The system checks the DataNode disk usage every 30 seconds and compares the actual disk usage with the threshold. A default threshold range is provided for the DataNode disk usage. This alarm is generated when the DataNode disk usage exceeds the threshold.</p>
<p id="ALM-14002__p15244590">To change the threshold, choose <strong id="ALM-14002__b3713133193110">O&amp;M</strong> &gt; <strong id="ALM-14002__b77135323116">Alarm</strong> &gt; <strong id="ALM-14002__b4713143153112">Thresholds</strong> &gt; <em id="ALM-14002__i1671416383119">Name of the desired cluster</em> &gt; <strong id="ALM-14002__b4714834319">HDFS</strong>.</p>
<p id="ALM-14002__p2983582">If <strong id="ALM-14002__b459114293216">Trigger Count</strong> is <strong id="ALM-14002__b257216499220">1</strong>, this alarm is cleared when the DataNode disk usage is less than or equal to the threshold. If <strong id="ALM-14002__b1097912391828">Trigger Count</strong> is greater than <strong id="ALM-14002__b19876751128">1</strong>, this alarm is cleared when the DataNode disk usage is less than or equal to 80% of the threshold.</p>
</div>
<div class="section" id="ALM-14002__section608563"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14002__table40343619" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14002__row45326725"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14002__p47586108">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14002__p29269531">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14002__p22021835">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14002__row38938228"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14002__p66988733">14002</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14002__p57378254">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14002__p17126992">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14002__section5477075"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14002__table45109098" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14002__row29250257"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14002__p20460612">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14002__p46696884">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14002__row3352617113212"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14002__p156438591896">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14002__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14002__row24351240"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14002__p65062640">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14002__p49391998">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14002__row41874799"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14002__p35626567">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14002__p63982358">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14002__row38970312"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14002__p51620924">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14002__p66558579">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14002__row62156303"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14002__p1495784">Trigger Condition</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14002__p54049714">Specifies the threshold for triggering the alarm.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14002__section49293677"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14002__p15950741">Insufficient disk space will impact data write to HDFS.</p>
</div>
<div class="section" id="ALM-14002__section40989916"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14002__ul16941624"><li id="ALM-14002__li18256889">The disk space configured for the HDFS cluster is insufficient.</li><li id="ALM-14002__li30094275">Data skew occurs among DataNodes.</li></ul>
</div>
<div class="section" id="ALM-14002__section33364928"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-14002__p21717215"><strong id="ALM-14002__b1591882453713">Check whether the cluster disk capacity is full.</strong></p>
<ol id="ALM-14002__ol7131397162757"><li id="ALM-14002__li66181733162749"><span>On FusionInsight Manager, choose <strong id="ALM-14002__b135365113380">O&amp;M</strong> &gt; <strong id="ALM-14002__b42161938384">Alarm</strong> &gt; <strong id="ALM-14002__b564555133819">Alarms</strong>, and check whether the <strong id="ALM-14002__b125811021183814">ALM-14001 HDFS Disk Usage Exceeds the Threshold</strong> alarm exists.</span><p><ul class="subitemlist" id="ALM-14002__ul52092768162749"><li id="ALM-14002__li36249487162749">If yes, go to <a href="#ALM-14002__li48847933162749">2</a>.</li><li id="ALM-14002__li50527328162749">If no, go to <a href="#ALM-14002__li49504103162749">4</a>.</li></ul>
</p></li><li id="ALM-14002__li48847933162749"><a name="ALM-14002__li48847933162749"></a><a name="li48847933162749"></a><span>Handle the alarm by following the instructions in <strong id="ALM-14002__b1932683563913">ALM-14001 HDFS Disk Usage Exceeds the Threshold</strong> and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14002__ul65079871162749"><li id="ALM-14002__li62319165162749">If yes, go to <a href="#ALM-14002__li5500455162749">3</a>.</li><li id="ALM-14002__li14687637162749">If no, go to <a href="#ALM-14002__li17443443162749">11</a>.</li></ul>
</p></li><li id="ALM-14002__li5500455162749"><a name="ALM-14002__li5500455162749"></a><a name="li5500455162749"></a><span>Choose <strong id="ALM-14002__b112850332244918">O&amp;M</strong> &gt; <strong id="ALM-14002__b83813856244918">Alarm </strong>&gt; <strong id="ALM-14002__b75605308744918">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14002__ul46464158162749"><li id="ALM-14002__li36978214162749">If yes, no further action is required.</li><li id="ALM-14002__li42445386162749">If no, go to <a href="#ALM-14002__li49504103162749">4</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14002__p15524242162749"><strong id="ALM-14002__b144922016144013">Check the balance status of DataNodes.</strong></p>
<ol start="4" id="ALM-14002__ol1905264162814"><li id="ALM-14002__li49504103162749"><a name="ALM-14002__li49504103162749"></a><a name="li49504103162749"></a><span>On FusionInsight Manager, choose <strong id="ALM-14002__b4413143914405">Hosts</strong>. Check whether the number of DataNodes on each rack is almost the same. If the difference is large, adjust the racks to which DataNodes belong to ensure that the number of DataNodes on each rack is almost the same. Restart the HDFS service for the settings to take effect.</span></li><li id="ALM-14002__li42883745162749"><span>Choose <strong id="ALM-14002__b153505750844918">Cluster</strong> &gt; <em id="ALM-14002__i66804565844918">Name of the desired cluster</em> &gt; <strong id="ALM-14002__b43031791244918">Services</strong> &gt; <strong id="ALM-14002__b214265994044918">HDFS</strong>.</span></li><li id="ALM-14002__li50409392162749"><span>In the <strong id="ALM-14002__b1084313433427">Basic Information</strong> area, click <strong id="ALM-14002__b619814714217">NameNode(Active)</strong>. The HDFS web UI is displayed.</span><p><div class="note" id="ALM-14002__note184603141102"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-14002__en-us_topic_0193189480_en-us_topic_0070539288_p3460314151015">By default, the <strong id="ALM-14002__en-us_topic_0193189480_en-us_topic_0070539288_b56516321216">admin</strong> user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.</p>
</div></div>
</p></li><li id="ALM-14002__li27870678162749"><span>In the <strong id="ALM-14002__b17446115814434">Summary</strong> area of the HDFS web UI, check whether the value of <strong id="ALM-14002__b121814924416">Max</strong> is 10% greater than that of <strong id="ALM-14002__b1524118156443">Median</strong> in <strong id="ALM-14002__b3945518124420">DataNodes usages</strong>.</span><p><ul class="subitemlist" id="ALM-14002__ul10553282162749"><li id="ALM-14002__li56628950162749">If yes, go to <a href="#ALM-14002__li25048823162749">8</a>.</li><li id="ALM-14002__li23542208162749">If no, go to <a href="#ALM-14002__li17443443162749">11</a>.</li></ul>
</p></li><li id="ALM-14002__li25048823162749"><a name="ALM-14002__li25048823162749"></a><a name="li25048823162749"></a><span>Balance skewed data in the cluster. Log in to the <span id="ALM-14002__text1758235613314">MRS</span> client as user <strong id="ALM-14002__b212807715544918">root</strong>. <span id="ALM-14002__text85258205227"></span> If the cluster is in normal mode, run the <strong id="ALM-14002__b197905474644918">su - omm</strong> command to switch to user <strong id="ALM-14002__b134236484744918">omm</strong>. Run the <strong id="ALM-14002__b212203586644918">cd</strong> command to go to the client installation directory and run the <strong id="ALM-14002__b21132896244918">source bigdata_env</strong> command. If the cluster uses the security mode, perform security authentication. Run <strong id="ALM-14002__b54978985162749">kinit hdfs</strong> and enter the password as prompted. Obtain the password from the MRS cluster administrator.</span></li><li id="ALM-14002__li35086403237"><span>Run the following command to balance data distribution:</span><p><p id="ALM-14002__p1669404182318"><strong id="ALM-14002__b24112821162749">hdfs balancer -threshold 10</strong></p>
</p></li><li id="ALM-14002__li1938160162749"><span>Wait several minutes and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14002__ul56362233162749"><li id="ALM-14002__li6981518162749">If yes, no further action is required.</li><li id="ALM-14002__li28632091162749">If no, go to <a href="#ALM-14002__li17443443162749">11</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14002__p37498053162749"><strong id="ALM-14002__b14085170162821">Collect the fault information.</strong></p>
<ol start="11" id="ALM-14002__ol42440289162824"><li id="ALM-14002__li17443443162749"><a name="ALM-14002__li17443443162749"></a><a name="li17443443162749"></a><span>On FusionInsight Manager, choose <strong id="ALM-14002__b1264585264811">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-14002__b20645165213489">Log</strong> &gt; <strong id="ALM-14002__b19645175218482">Download</strong>.</span></li><li id="ALM-14002__li22773263162749"><span>Expand the drop-down list next to the <strong id="ALM-14002__b9758114392017">Service</strong> field. In the <strong id="ALM-14002__b5764943182015">Services</strong> dialog box that is displayed, select <strong id="ALM-14002__b4764164314209">HDFS</strong> for the target cluster.</span></li><li id="ALM-14002__li3632782162749"><span>Click <span><img id="ALM-14002__image154963213496" src="en-us_image_0263895382.png"></span> in the upper right corner, and set <strong id="ALM-14002__b104977294914">Start Date</strong> and <strong id="ALM-14002__b15497112114918">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14002__b12497526497">Download</strong>.</span></li><li id="ALM-14002__li25819901162749"><span>Contact <span id="ALM-14002__text127035819491">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-14002__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14002__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-14002__section31848898"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14002__p9900716">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>