forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
107 lines
20 KiB
HTML
107 lines
20 KiB
HTML
<a name="ALM-14022"></a><a name="ALM-14022"></a>
|
|
|
|
<h1 class="topictitle1">ALM-14022 NameNode Average RPC Queuing Time Exceeds the Threshold</h1>
|
|
<div id="body1505292051690"><div class="section" id="ALM-14022__section750643617318"><h4 class="sectiontitle">Description</h4><p id="ALM-14022__p404154917318">The system checks the average RPC queuing time of NameNode every 30 seconds, and compares the actual average RPC queuing time with the threshold (default value: 200 ms). This alarm is generated when the system detects that the average RPC queuing time exceeds the threshold for several consecutive times (10 times by default).</p>
|
|
<p id="ALM-14022__p5893001417318">You can choose <strong id="ALM-14022__en-us_topic_0070543655_b35169757">O&M > Alarm > Thresholds ></strong> <em id="ALM-14022__i175981233194118">Name of the desired cluster</em> > <strong id="ALM-14022__en-us_topic_0070543655_b3167375">HDFS</strong> to change the threshold.</p>
|
|
<p id="ALM-14022__p2960558417318">When the <strong id="ALM-14022__b48421890111935">Trigger Count</strong> is 1, this alarm is cleared when the average RPC queuing time of NameNode is less than or equal to the threshold. When the <strong id="ALM-14022__b348403763817">Trigger Count</strong> is greater than 1, this alarm is cleared when the average RPC queuing time of NameNode is less than or equal to 90% of the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14022__section6512366917318"><h4 class="sectiontitle">Attribute</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14022__table4052584917318" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14022__row5134743017318"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14022__p6550114717318">Alarm ID</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14022__p399270517318">Alarm Severity</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14022__p5497369517318">Automatically Cleared</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-14022__row2368435017318"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14022__p3938423017318">14022</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14022__p3600604617318">Major</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14022__p3080859017318">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-14022__section1246787117318"><h4 class="sectiontitle">Parameters</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14022__table326463217318" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14022__row3665211117318"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14022__p1603098517318">Name</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14022__p2344139117318">Meaning</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-14022__row1865111239338"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14022__p192431315431">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14022__p692551319435">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14022__row1970455517318"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14022__p5256509017318">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14022__p2991391817318">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14022__row78981417318"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14022__p6397494817318">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14022__p1458826617318">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14022__row6418553417318"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14022__p3164575617318">HostName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14022__p1316944717318">Specifies the host for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14022__row5141616017318"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14022__p1372617291401">NameServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14022__p5227992217318">Specifies the NameService service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14022__row75725017318"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14022__p6133726217318">Trigger condition</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14022__p226235117318">Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-14022__section4903278117318"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14022__p1223233517318">NameNode cannot process the RPC requests from HDFS clients, upper-layer services that depend on HDFS, and DataNode in a timely manner. Specifically, the services that access HDFS run slowly or the HDFS service is unavailable.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14022__section4298215817318"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14022__ul5900273717318"><li id="ALM-14022__li6126259317318">The CPU performance of NameNode nodes is insufficient and therefore NameNode nodes cannot process messages in a timely manner.</li><li id="ALM-14022__li1449243017318">The configured NameNode memory is too small and frame freezing occurs on the JVM due to frequent full garbage collection.</li><li id="ALM-14022__li6332300917318">NameNode parameters are not configured properly, so NameNode cannot make full use of system performance.</li><li id="ALM-14022__li3303617117318">The volume of services that access HDFS is too large and therefore NameNode is overloaded.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-14022__section5868417617318"><h4 class="sectiontitle">Procedure</h4><p id="ALM-14022__p5579783017318"><strong id="ALM-14022__b3241842717318">Obtain alarm information.</strong></p>
|
|
<ol id="ALM-14022__ol10712061174842"><li id="ALM-14022__li864690817318"><span>On the FusionInsight Manager portal, choose <strong id="ALM-14022__b24980487172449">O&M > Alarm > Alarms</strong>. In the alarm list, click the alarm.</span></li><li id="ALM-14022__li2931097917318"><span>Check the alarm. Obtain the alarm generation time from <strong id="ALM-14022__b2537906117318">Gen</strong><strong id="ALM-14022__b1880921922319">erated</strong>. Obtain the host name of the NameNode node involved in this alarm from the <strong id="ALM-14022__b2708496317318">HostName </strong>information of <strong id="ALM-14022__b4243807617318">Location</strong>. Then obtain the name of the NameService node involved in this alarm from the <strong id="ALM-14022__b4639837117318">NameServiceName</strong> information of <strong id="ALM-14022__b1493216017318">Location</strong>.</span></li></ol>
|
|
<p id="ALM-14022__p4297101174848"><strong id="ALM-14022__b17979009174849">Check whether the threshold is too small.</strong></p>
|
|
<ol start="3" id="ALM-14022__ol50513287175520"><li id="ALM-14022__li5807247717318"><span>Check the status of the services that depend on HDFS. Check whether the services run slowly or task execution times out.</span><p><ul id="ALM-14022__ul5289024717318"><li id="ALM-14022__li625018217318">If yes, go to <a href="#ALM-14022__li6328681517318">8</a>.</li><li id="ALM-14022__li6008913317318">If no, go to <a href="#ALM-14022__li4999873217318">4</a>.</li></ul>
|
|
</p></li><li id="ALM-14022__li4999873217318"><a name="ALM-14022__li4999873217318"></a><a name="li4999873217318"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14022__b0357102094314">Cluster > </strong><em id="ALM-14022__i11360520124318">Name of the desired cluster</em><strong id="ALM-14022__b17358420174317"> > Services</strong> > <strong id="ALM-14022__b2336552817318">HDFS</strong>. Click the drop-down menu in the upper right corner of <strong id="ALM-14022__b3273144141318">Chart</strong>, choose <strong id="ALM-14022__b7246166191312">Customize</strong> > <strong id="ALM-14022__b15702441192211">RPC</strong>, and select <strong id="ALM-14022__b5492738917318">Average Time of Active NameNode RPC Queuing</strong> and click <strong id="ALM-14022__b2458445917318">OK</strong>.</span></li><li id="ALM-14022__li4518419617318"><span>On the <strong id="ALM-14022__b400458517318">Average Time of Active NameNode RPC Queuing </strong>monitoring page, obtain the value of the NameService node involved in this alarm.</span></li><li id="ALM-14022__li3604126517318"><span>On the FusionInsight Manager portal, choose <strong id="ALM-14022__b14700839164913">O&M > Alarm > Thresholds > </strong><em id="ALM-14022__i7702183916491">Name of the desired cluster</em> <strong id="ALM-14022__b15701143915492">></strong> <strong id="ALM-14022__b3841158527">HDFS</strong>. Locate <strong id="ALM-14022__b727811357574">Average Time of Active NameNode RPC Queuing</strong> and click the <strong id="ALM-14022__b97753479185">Modify</strong> in the <strong id="ALM-14022__b4440437317318">Operation</strong> column of the default rule. The <strong id="ALM-14022__b6409504417318">Modify Rule</strong> page is displayed. Change <strong id="ALM-14022__b3998448517318">Threshold</strong> to 150% of the monitored value. Click <strong id="ALM-14022__b2431605117318">OK</strong> to save the new threshold.</span></li><li id="ALM-14022__li2344310317318"><span>Wait for 1 minute and then check whether the alarm is automatically cleared.</span><p><ul id="ALM-14022__ul966133517318"><li id="ALM-14022__li1984315117318">If yes, no further action is required.</li><li id="ALM-14022__li4437063217318">If no, go to <a href="#ALM-14022__li6328681517318">8</a>.</li></ul>
|
|
<p id="ALM-14022__p6682745217318"><strong id="ALM-14022__b6457616417318">Check whether the CPU performance of the NameNode node is sufficient.</strong></p>
|
|
</p></li><li id="ALM-14022__li6328681517318"><a name="ALM-14022__li6328681517318"></a><a name="li6328681517318"></a><span>On the FusionInsight Manager portal, click <strong id="ALM-14022__b1067214543198">O&M > Alarm > Alarms</strong> and check whether <strong id="ALM-14022__b2595840317318">ALM-12016 HDFS NameNode Memory Usage Exceeds the Threshold</strong> is generated.</span><p><ul id="ALM-14022__ul3229903517318"><li id="ALM-14022__li2225586217318">If yes, go to <a href="#ALM-14022__li922016517318">9</a>.</li><li id="ALM-14022__li5128787017318">If no, go to <a href="#ALM-14022__li3577444117318">11</a>.</li></ul>
|
|
</p></li><li id="ALM-14022__li922016517318"><a name="ALM-14022__li922016517318"></a><a name="li922016517318"></a><span>Handle <strong id="ALM-14022__b1587262717318">ALM-12016 CPU Usage Exceeds the Threshold</strong> by taking recommended actions.</span></li><li id="ALM-14022__li863591517318"><span>Wait for 10 minutes and check whether alarm 14022 is automatically cleared.</span><p><ul id="ALM-14022__ul1061437617318"><li id="ALM-14022__li2842052117318">If yes, no further action is required.</li><li id="ALM-14022__li5445810517318">If no, go to <a href="#ALM-14022__li3577444117318">11</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14022__p30751880175527"><strong id="ALM-14022__b17955966175528">Check whether the memory of the NameNode node is too small.</strong></p>
|
|
<ol start="11" id="ALM-14022__ol22946819175611"><li id="ALM-14022__li3577444117318"><a name="ALM-14022__li3577444117318"></a><a name="li3577444117318"></a><span>On the FusionInsight Manager portal, click <strong id="ALM-14022__b78961469202">O&M > Alarm > Alarms</strong> and check whether <strong id="ALM-14022__b1204863717318">ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold</strong> is generated.</span><p><ul id="ALM-14022__ul4132887317318"><li id="ALM-14022__li3641554117318">If yes, go to <a href="#ALM-14022__li5900064917318">12</a>.</li><li id="ALM-14022__li3892858817318">If no, go to <a href="#ALM-14022__li2539715217318">14</a>.</li></ul>
|
|
</p></li><li id="ALM-14022__li5900064917318"><a name="ALM-14022__li5900064917318"></a><a name="li5900064917318"></a><span>Handle <strong id="ALM-14022__b6124379417318">ALM-14007 CPU Usage Exceeds the Threshold </strong>by taking recommended actions.</span></li><li id="ALM-14022__li1432323517318"><span>Wait for 10 minutes and check whether alarm 14022 is automatically cleared.</span><p><ul id="ALM-14022__ul6180025317318"><li id="ALM-14022__li1933136917318">If yes, no further action is required.</li><li id="ALM-14022__li3976460117318">If no, go to <a href="#ALM-14022__li2539715217318">14</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14022__p61023602175617"><strong id="ALM-14022__b19259288175618">Check whether NameNode parameters are configured properly.</strong></p>
|
|
<ol start="14" id="ALM-14022__ol31426325175655"><li id="ALM-14022__li2539715217318"><a name="ALM-14022__li2539715217318"></a><a name="li2539715217318"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14022__b7541928184420">Cluster > </strong><em id="ALM-14022__i657192854416">Name of the desired cluster</em><strong id="ALM-14022__b145417289447"> > Services</strong> > <strong id="ALM-14022__b2162943793121">HDFS</strong> > <strong id="ALM-14022__b6593994694053">Configurations</strong> > <strong id="ALM-14022__b715398193121">All</strong> <strong id="ALM-14022__b6816162115232">Configurations</strong>. Search for parameter <strong id="ALM-14022__b6651496417318">dfs.namenode.handler.count</strong> and view its value. If the value is less than or equal to 128, change it to <strong id="ALM-14022__b6176376817318">128</strong>. If the value is greater than 128 but less than 192, change it to <strong id="ALM-14022__b1900300317318">192</strong>.</span></li><li id="ALM-14022__li3680930217318"><span>Search for parameter <strong id="ALM-14022__b6284826617318">ipc.server.read.threadpool.size</strong> and view its value. If the value is less than 5, change it to <strong id="ALM-14022__b2876348917318">5</strong>.</span></li><li id="ALM-14022__li5754481017318"><span>Click <strong id="ALM-14022__b4814124817318">Save</strong>, and click <strong id="ALM-14022__b712700417318">OK</strong>.</span></li><li id="ALM-14022__li4041641217318"><span>On the <strong id="ALM-14022__b2820338917318">Instance</strong> page of HDFS, select the standby NameNode of NameService involved in this alarm and choose <strong id="ALM-14022__b5250391217318">More </strong>> <strong id="ALM-14022__b277316717318">Restart Instance</strong>. Enter the password and click <strong id="ALM-14022__b2495850517318">OK</strong>. Wait until the standby NameNode is started up.</span></li><li id="ALM-14022__li2329996117318"><span>On the <strong id="ALM-14022__b837306317318">Instance</strong> page of HDFS, select the active NameNode of NameService involved in this alarm and choose <strong id="ALM-14022__b824870817318">More</strong> > <strong id="ALM-14022__b712951317318">Restart Instance</strong>. Enter the password and click <strong id="ALM-14022__b6416562417318">OK</strong>. Wait until the active NameNode is started up.</span></li><li id="ALM-14022__li4061970417318"><span>Wait for 1 hour and then check whether the alarm is automatically cleared.</span><p><ul id="ALM-14022__ul3003301917318"><li id="ALM-14022__li186172317318">If yes, no further action is required.</li><li id="ALM-14022__li1675551217318">If no, go to <a href="#ALM-14022__li2529838417318">20</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14022__p58884337175658"><strong id="ALM-14022__b39017602175659">Check whether the HDFS workload changes and reduce the workload properly.</strong></p>
|
|
<ol start="20" id="ALM-14022__ol1375842417582"><li id="ALM-14022__li2529838417318"><a name="ALM-14022__li2529838417318"></a><a name="li2529838417318"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14022__b1568412447446">Cluster > </strong><em id="ALM-14022__i1685174444416">Name of the desired cluster</em><strong id="ALM-14022__b17684114418443"> > Services</strong> > <strong id="ALM-14022__b3590323017318">HDFS</strong>. Click the drop-down menu in the upper right corner of <strong id="ALM-14022__b1685212012551">Chart</strong>, click <strong id="ALM-14022__b0852620165510">Customize</strong>, select <strong id="ALM-14022__b16305221817">Average Time of Active NameNode RPC Queuing</strong> and click <strong id="ALM-14022__b898586117318">OK</strong>.</span></li><li id="ALM-14022__li1376388717318"><span>Click <span><img id="ALM-14022__image454164016262" src="en-us_image_0269417366.png"></span>. The <strong id="ALM-14022__b16827164118286">Details</strong> page is displayed.</span></li><li id="ALM-14022__li4113307817318"><span>Set the monitoring data display period, from 5 days before the alarm generation time to the alarm generation time. Click <strong id="ALM-14022__b55083111298">OK</strong>.</span></li><li id="ALM-14022__li5546113417318"><span>On the <strong id="ALM-14022__b2938816217318">Average RPC Queuing Time</strong> monitoring page, check whether the point in time when the queuing time increases abruptly exists.</span><p><ul id="ALM-14022__ul1196892817318"><li id="ALM-14022__li4061149217318">If yes, go to <a href="#ALM-14022__li6583884617318">24</a>.</li><li id="ALM-14022__li1076881617318">If no, go to <a href="#ALM-14022__li4075154117318">27</a>.</li></ul>
|
|
</p></li><li id="ALM-14022__li6583884617318"><a name="ALM-14022__li6583884617318"></a><a name="li6583884617318"></a><span>Confirm and check the point in time. Check whether a new task frequently accesses HDFS and whether the access frequency can be reduced.</span></li><li id="ALM-14022__li5567870617318"><span>If a Balancer task starts at the point in time, stop the task or specify a node for the task to reduce the HDFS workload.</span></li><li id="ALM-14022__li3134630617318"><span>Wait for 1 hour and then check whether the alarm is automatically cleared.</span><p><ul id="ALM-14022__ul1368130517318"><li id="ALM-14022__li5602288717318">If yes, no further action is required.</li><li id="ALM-14022__li3444394117318">If no, go to <a href="#ALM-14022__li4075154117318">27</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-14022__p66717514175811"><strong id="ALM-14022__b40929879175812">Collect fault information.</strong></p>
|
|
<ol start="27" id="ALM-14022__ol10448549175815"><li id="ALM-14022__li4075154117318"><a name="ALM-14022__li4075154117318"></a><a name="li4075154117318"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14022__b39977366113627">O&M</strong> > <strong id="ALM-14022__b24251979113627">Log > Download</strong>.</span></li><li id="ALM-14022__li4575577517318"><span>Select <strong id="ALM-14022__b133271358102519">HDFS </strong>in the required cluster from the <strong id="ALM-14022__b914879817318">Service</strong>.</span></li><li id="ALM-14022__li1145664103113"><span>Click <span><img id="ALM-14022__image1945644173117" src="en-us_image_0269417367.png"></span> in the upper right corner, and set <strong id="ALM-14022__b6456941173117">Start Date</strong> and <strong id="ALM-14022__b11456154113318">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14022__b13456164113319">Download</strong>.</span></li><li id="ALM-14022__li1200838917318"><span>Contact the <span id="ALM-14022__text4614151421417">O&M personnel</span> and send the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-14022__section1529716184534"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14022__p4677152685316">After the fault is rectified, the system automatically clears this alarm.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14022__section4096664417318"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14022__p2996385417318">None</p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|