forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
106 lines
18 KiB
HTML
106 lines
18 KiB
HTML
<a name="ALM-14021"></a><a name="ALM-14021"></a>
|
|
|
|
<h1 class="topictitle1">ALM-14021 NameNode Average RPC Processing Time Exceeds the Threshold</h1>
|
|
<div id="body1505292051690"><div class="section" id="ALM-14021__section62084932172449"><h4 class="sectiontitle">Description</h4><p id="ALM-14021__p62823563172449">The system checks the average RPC processing time of NameNode every 30 seconds, and compares the actual average RPC processing time with the threshold (default value: 100 ms). This alarm is generated when the system detects that the average RPC processing time exceeds the threshold for several consecutive times (10 times by default).</p>
|
|
<p id="ALM-14021__p30132727172449">You can choose <strong id="ALM-14021__b17605111213495">O&M > Alarm > Thresholds ></strong> <em id="ALM-14021__i16064120493">Name of the desired cluster</em> > <strong id="ALM-14021__en-us_topic_0070543655_b3167375">HDFS</strong> to change the threshold.</p>
|
|
<p id="ALM-14021__p31216245172449">When the <strong id="ALM-14021__b48421890111935">Trigger Count</strong> is 1, this alarm is cleared when the average RPC processing time of NameNode is less than or equal to the threshold. When the <strong id="ALM-14021__b145781733153810">Trigger Count</strong> is greater than 1, this alarm is cleared when the average RPC processing time of NameNode is less than or equal to 90% of the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14021__section6738097172449"><h4 class="sectiontitle">Attribute</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14021__table8915002172449" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14021__row46056374172449"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14021__p39578800172449">Alarm ID</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14021__p51766251172449">Alarm Severity</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14021__p32316815172449">Automatically Cleared</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-14021__row416322172449"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14021__p33722159172449">14021</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14021__p47140348172449">Major</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14021__p60271825172449">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-14021__section50179671172449"><h4 class="sectiontitle">Parameters</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14021__table38021557172449" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14021__row7932414172449"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14021__p38545813172449">Name</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14021__p35203177172449">Meaning</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-14021__row10723314334"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14021__p192431315431">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14021__p692551319435">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14021__row32885089172449"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14021__p46446566172449">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14021__p4075487172449">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14021__row36679387172449"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14021__p18240350172449">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14021__p1073353172449">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14021__row9660181172449"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14021__p44277233172449">HostName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14021__p29686088172449">Specifies the host for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14021__row65848206172449"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14021__p59271415193920">NameServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14021__p50327034172449">Specifies the NameService service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14021__row50290122172449"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14021__p46968100172449">Trigger condition</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14021__p46319746172449">Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-14021__section60911949172449"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14021__p34920812172449">NameNode cannot process the RPC requests from HDFS clients, upper-layer services that depend on HDFS, and DataNode in a timely manner. Specifically, the services that access HDFS run slowly or the HDFS service is unavailable.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14021__section45851855172449"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14021__ul23012744172449"><li id="ALM-14021__li5788104172449">The CPU performance of NameNode nodes is insufficient and therefore NameNode nodes cannot process messages in a timely manner.</li><li id="ALM-14021__li52092937172449">The configured NameNode memory is too small and frame freezing occurs on the JVM due to frequent full garbage collection.</li></ul>
|
|
<ul id="ALM-14021__ul59243221172449"><li id="ALM-14021__li63426949172449">NameNode parameters are not configured properly, so NameNode cannot make full use of system performance.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-14021__section33971635172449"><h4 class="sectiontitle">Procedure</h4><p id="ALM-14021__p239061172449"><strong id="ALM-14021__b2151556172449">Obtain alarm information.</strong></p>
|
|
<ol id="ALM-14021__ol64359392174922"><li id="ALM-14021__li40058311172449"><span>On the FusionInsight Manager portal, choose <strong id="ALM-14021__b24980487172449">O&M > Alarm > Alarm</strong><strong id="ALM-14021__b39898316336">s</strong>. In the alarm list, click the alarm.</span></li><li id="ALM-14021__li23497796172449"><span>Check the alarm. Obtain the host name of the NameNode node involved in this alarm from the <strong id="ALM-14021__b24273313172449">HostName</strong> information of <strong id="ALM-14021__b17133227172449">Location</strong>. Then obtain the name of the NameService node involved in this alarm from the <strong id="ALM-14021__b19981318172449">NameServiceName</strong> information of <strong id="ALM-14021__b45614140172449">Location</strong>.</span></li></ol>
|
|
<p id="ALM-14021__p32364400174924"><strong id="ALM-14021__b2846005174925">Check whether the threshold is too small.</strong></p>
|
|
<ol start="3" id="ALM-14021__ol17734475174928"><li id="ALM-14021__li33820426172449"><span>Check the status of the services that depend on HDFS. Check whether the services run slowly or task execution times out.</span><p><ul id="ALM-14021__ul35948386172449"><li id="ALM-14021__li55100025172449">If yes, go to <a href="#ALM-14021__li48203297172449">8</a>.</li><li id="ALM-14021__li36818203172449">If no, go to <a href="#ALM-14021__li29484482172449">4</a>.</li></ul>
|
|
</p></li><li id="ALM-14021__li29484482172449"><a name="ALM-14021__li29484482172449"></a><a name="li29484482172449"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14021__b131013471417">Cluster > </strong><em id="ALM-14021__i153121047184114">Name of the desired cluster</em><strong id="ALM-14021__b113111647194119"> > Services</strong> > <strong id="ALM-14021__b39432870172449">HDFS</strong>. Click the drop-down menu in the upper right corner of <strong id="ALM-14021__b3273144141318">Chart</strong>, choose <strong id="ALM-14021__b7246166191312">Customize</strong> > <strong id="ALM-14021__b15702441192211">RPC</strong>, and select <strong id="ALM-14021__b23968803172449">Average Time of Active NameNode RPC Processing</strong> and click <strong id="ALM-14021__b14392643172449">OK</strong>.</span></li><li id="ALM-14021__li62424925172449"><span>On the <strong id="ALM-14021__b24953420172449">Average Time of Active NameNode RPC Processing </strong>monitoring page, obtain the value of the NameService node involved in this alarm.</span></li><li id="ALM-14021__li23254192172449"><span>On the FusionInsight Manager portal, choose <strong id="ALM-14021__b146122213493">O&M > Alarm > Thresholds > </strong><em id="ALM-14021__i11464822194911">Name of the desired cluster</em><strong id="ALM-14021__b1746262204915"> ></strong> <strong id="ALM-14021__b3841158527">HDFS</strong>. Locate <strong id="ALM-14021__b40872412172449">Average Time of Active NameNode RPC Processing</strong> and click the <strong id="ALM-14021__b96499717414">Modify</strong> in the <strong id="ALM-14021__b22331058172449">Operation</strong> column of the default rule. The <strong id="ALM-14021__b978719451349">Modify Rule</strong> page is displayed. Change <strong id="ALM-14021__b63985310172449">Threshold</strong> to 150% of the peak value within one day before and after the alarm is generated. Click <strong id="ALM-14021__b38996878172449">OK</strong> to save the new threshold.</span></li><li id="ALM-14021__li15427589172449"><span>Wait for 5 minutes and then check whether the alarm is automatically cleared.</span><p><ul id="ALM-14021__ul4630579172449"><li id="ALM-14021__li41675217172449">If yes, no further action is required.</li><li id="ALM-14021__li39532641172449">If no, go to <a href="#ALM-14021__li48203297172449">8</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14021__p33563136173532"><strong id="ALM-14021__b15572644173533">Check whether the CPU performance of the NameNode node is sufficient.</strong></p>
|
|
<ol start="8" id="ALM-14021__ol704880617369"><li id="ALM-14021__li48203297172449"><a name="ALM-14021__li48203297172449"></a><a name="li48203297172449"></a><span>On the FusionInsight Manager portal, click <strong id="ALM-14021__b31176492172449">O&M > Alarm >Alarms</strong> and check whether <strong id="ALM-14021__b12152975172449">ALM-12016 CPU Usage Exceeds the Threshold</strong> is generated for the NameNode node.</span><p><ul id="ALM-14021__ul42267915172449"><li id="ALM-14021__li44866922172449">If yes, go to <a href="#ALM-14021__li23155373172449">9</a>.</li><li id="ALM-14021__li25969508172449">If no, go to <a href="#ALM-14021__li29576569172449">11</a>.</li></ul>
|
|
</p></li><li id="ALM-14021__li23155373172449"><a name="ALM-14021__li23155373172449"></a><a name="li23155373172449"></a><span>Handle <strong id="ALM-14021__b7071765172449">ALM-12016 CPU Usage Exceeds the Threshold</strong> by taking recommended actions.</span></li><li id="ALM-14021__li63645886172449"><span>Wait for 10 minutes and check whether alarm 14021 is automatically cleared.</span><p><ul id="ALM-14021__ul35942065172449"><li id="ALM-14021__li55043136172449">If yes, no further action is required.</li><li id="ALM-14021__li25626178172449">If no, go to <a href="#ALM-14021__li29576569172449">11</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14021__p33032850173615"><strong id="ALM-14021__b45503751173618">Check whether the memory of the NameNode node is too small.</strong></p>
|
|
<ol start="11" id="ALM-14021__ol5163544717386"><li id="ALM-14021__li29576569172449"><a name="ALM-14021__li29576569172449"></a><a name="li29576569172449"></a><span>On the FusionInsight Manager portal, click <strong id="ALM-14021__b1553819427818">O&M > Alarm >Alarms</strong> and check whether <strong id="ALM-14021__b46891852172449">ALM-14007 HDFS NameNode Heap Memory Usage Exceeds the Threshold </strong>is generated for the NameNode node.</span><p><ul id="ALM-14021__ul19373492172449"><li id="ALM-14021__li40143705172449">If yes, go to <a href="#ALM-14021__li26363673172449">12</a>.</li><li id="ALM-14021__li5296504172449">If no, go to <a href="#ALM-14021__li41096175172449">14</a>.</li></ul>
|
|
</p></li><li id="ALM-14021__li26363673172449"><a name="ALM-14021__li26363673172449"></a><a name="li26363673172449"></a><span>Handle <strong id="ALM-14021__b35946465172449">ALM-14007 HDFS NameNode Heap Memory Usage Exceeds the Threshold</strong> by taking recommended actions.</span></li><li id="ALM-14021__li55082734172449"><span>Wait for 10 minutes and check whether alarm 14021 is automatically cleared.</span><p><ul id="ALM-14021__ul25982559172449"><li id="ALM-14021__li32516447172449">If yes, no further action is required.</li><li id="ALM-14021__li24212574172449">If no, go to <a href="#ALM-14021__li41096175172449">14</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14021__p19714972173814"><strong id="ALM-14021__b24903380173815">Check whether NameNode parameters are configured properly.</strong></p>
|
|
<ol start="14" id="ALM-14021__ol10807579173828"><li id="ALM-14021__li41096175172449"><a name="ALM-14021__li41096175172449"></a><a name="li41096175172449"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14021__b19287183210421">Cluster > </strong><em id="ALM-14021__i182895326427">Name of the desired cluster</em><strong id="ALM-14021__b728710321421"> > Services</strong> > <strong id="ALM-14021__b2162943793121">HDFS</strong> > <strong id="ALM-14021__b6593994694053">Configurations</strong> > <strong id="ALM-14021__b715398193121">All</strong> <strong id="ALM-14021__b6816162115232">Configurations</strong>. Search for parameter <strong id="ALM-14021__b55703035172449">dfs.namenode.handler.count</strong> and view its value. If the value is less than or equal to 128, change it to <strong id="ALM-14021__b31565272172449">128</strong>. If the value is greater than 128 but less than 192, change it to <strong id="ALM-14021__b15652000172449">192</strong>.</span></li><li id="ALM-14021__li6650274172449"><span>Search for parameter <strong id="ALM-14021__b59852473172449">ipc.server.read.threadpool.size</strong> and view its value. If the value is less than 5, change it to <strong id="ALM-14021__b1801349172449">5</strong>.</span></li><li id="ALM-14021__li16212143172449"><span>Click <strong id="ALM-14021__b11691560172449">Save</strong> and click <strong id="ALM-14021__b7492275172449">OK</strong>.</span></li><li id="ALM-14021__li321615172449"><span>On the <strong id="ALM-14021__b2894543172449">Instance</strong> page of HDFS, select the standby NameNode of NameService involved in this alarm and choose <strong id="ALM-14021__b26050891172449">More</strong> > <strong id="ALM-14021__b33131427172449">Restart Instance</strong>. Enter the password and click <strong id="ALM-14021__b29747394172449">OK</strong>. Wait until the standby NameNode is started up.</span></li><li id="ALM-14021__li66399960172449"><span>On the <strong id="ALM-14021__b60728734172449">Instance</strong> page of HDFS, select the active NameNode of NameService involved in this alarm and choose <strong id="ALM-14021__b9687702172449">More</strong> > <strong id="ALM-14021__b20080455172449">Restart Instance</strong>. Enter the password and click <strong id="ALM-14021__b46506372172449">OK</strong>. Wait until the active NameNode is started up.</span></li><li id="ALM-14021__li15904164172449"><span>Wait for 1 hour and then check whether the alarm is automatically cleared.</span><p><ul id="ALM-14021__ul8919754172449"><li id="ALM-14021__li13168926172449">If yes, no further action is required.</li><li id="ALM-14021__li51411473172449">If no, go to <a href="#ALM-14021__li59520454172449">20</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14021__p445312173834"><strong id="ALM-14021__b4653260173835">Collect fault information.</strong></p>
|
|
<ol start="20" id="ALM-14021__ol13071154173840"><li id="ALM-14021__li59520454172449"><a name="ALM-14021__li59520454172449"></a><a name="li59520454172449"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14021__b39977366113627">O&M</strong> > <strong id="ALM-14021__b24251979113627">Log > Download</strong>.</span></li><li id="ALM-14021__li38085074172449"><span>Select the following node in the required cluster from the <strong id="ALM-14021__b7221346172449">Service</strong>.</span><p><ul id="ALM-14021__ul48058126172449"><li id="ALM-14021__li29869957172449">HDFS</li></ul>
|
|
</p></li><li id="ALM-14021__li1145664103113"><span>Click <span><img id="ALM-14021__image1945644173117" src="en-us_image_0269417365.png"></span> in the upper right corner, and set <strong id="ALM-14021__b6456941173117">Start Date</strong> and <strong id="ALM-14021__b11456154113318">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14021__b13456164113319">Download</strong>.</span></li><li id="ALM-14021__li35946735172449"><span>Contact the <span id="ALM-14021__text4614151421417">O&M personnel</span> and send the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-14021__section1529716184534"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14021__p4677152685316">After the fault is rectified, the system automatically clears this alarm.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14021__section55085166172449"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14021__p32713425172449">None</p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|