forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
103 lines
15 KiB
HTML
103 lines
15 KiB
HTML
<a name="ALM-19020"></a><a name="ALM-19020"></a>
|
|
|
|
<h1 class="topictitle1">ALM-19020 Number of HBase WAL Files to Be Synchronized Exceeds the Threshold</h1>
|
|
<div id="body0000001113292058"><div class="section" id="ALM-19020__section2812165915185"><h4 class="sectiontitle">Description</h4><p id="ALM-19020__p1231351418316">The system checks the number of WAL files to be synchronized by the RegionServer of each HBase service instance every 30 seconds. This indicator can be viewed on the RegionServer role monitoring page. This alarm is generated when the number of WAL files to be synchronized on a RegionServer exceeds the threshold (exceeding 128 for 20 consecutive times by default). To change the threshold, choose <strong id="ALM-19020__b2731194806645">O&M</strong> > <strong id="ALM-19020__b25094796645">Alarm</strong> > <strong id="ALM-19020__b15066789956645">Threshold Configuration</strong> > <em id="ALM-19020__i7129436826645">Name of the desired cluster</em> > <strong id="ALM-19020__b223859976645">HBase </strong>. This alarm is cleared when the number of WAL files to be synchronized is less than or equal to the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-19020__section25955920317"><h4 class="sectiontitle">Attribute</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-19020__table153018141532" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-19020__row1431461419318"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-19020__p1831414141837">Alarm ID</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-19020__p1931416141437">Alarm Severity</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-19020__p1331431416317">Auto Clear</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-19020__row531419147312"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-19020__p0314201416319">19020</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-19020__p83141314938">Major</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-19020__p1631419141733">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-19020__section141612263418"><h4 class="sectiontitle">Parameters</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-19020__table74516141535" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-19020__row53145146319"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-19020__p53165149310">Name</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-19020__p14316191411314">Meaning</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-19020__row11533153471712"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19020__p13858113752316">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19020__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-19020__row15316814937"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19020__p39123317">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19020__p83161014635">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-19020__row193161314938"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19020__p37226997">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19020__p18316114535">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-19020__row9316141415313"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19020__p66118565">HostName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19020__p33168145315">Specifies the host for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-19020__row18226125716113"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19020__p26086497">Trigger Condition</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19020__p32631511">Specifies the threshold for triggering the alarm.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-19020__section16986944941"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-19020__p631619141939">If the number of WAL files to be synchronized by a RegionServer exceeds the threshold, the number of ZNodes used by HBase exceeds the threshold, affecting the HBase service status.</p>
|
|
</div>
|
|
<div class="section" id="ALM-19020__section1855111591448"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-19020__ul1347121412510"><li id="ALM-19020__li781210818301">The network is abnormal.</li><li id="ALM-19020__li1844332019318">The RegionServer region distribution is unbalanced.</li><li id="ALM-19020__li546202051815">The HBase service scale of the standby cluster is too small.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-19020__section59091196204"><h4 class="sectiontitle">Procedure</h4><p id="ALM-19020__p15316151413319">View alarm location information.</p>
|
|
<ol id="ALM-19020__ol136331058142014"><li id="ALM-19020__li250941518816"><span>Log in to FusionInsight Manager and choose <strong id="ALM-19020__b14147119105318">O&M</strong>. In the navigation pane on the left, choose <strong id="ALM-19020__b1415719965318">Alarm</strong> > <strong id="ALM-19020__b101588916539">Alarms</strong>. On the page that is displayed, locate the row containing the alarm whose <strong id="ALM-19020__b0158179175314">Alarm ID</strong> is <strong id="ALM-19020__b19158199535">19020</strong>, and view the service instance and host name in <strong id="ALM-19020__b91591912534">Location</strong>.</span></li></ol>
|
|
<p id="ALM-19020__p14370104016115">Check the network connection between RegionServers on active and standby clusters.</p>
|
|
<ol start="2" id="ALM-19020__ol1547114406113"><li id="ALM-19020__li11468174021113"><span>Run the <strong id="ALM-19020__b11468540191120">ping</strong> command to check whether the network connection between the faulty RegionServer node and the host where RegionServer of the standby cluster resides is normal.</span><p><ul class="subitemlist" id="ALM-19020__ul1468640201116"><li id="ALM-19020__li1946884071113">If yes, go to <a href="#ALM-19020__li11347192371020">5</a>.</li><li class="subitemlist" id="ALM-19020__li20468194071112">If no, go to <a href="#ALM-19020__li1946854011118">3</a>.</li></ul>
|
|
</p></li><li id="ALM-19020__li1946854011118"><a name="ALM-19020__li1946854011118"></a><a name="li1946854011118"></a><span>Contact the network administrator to restore the network. </span></li><li id="ALM-19020__li8469640131111"><span>After the network recovers, check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-19020__ul2046894041118"><li id="ALM-19020__li446834019118">If yes, no further action is required.</li><li id="ALM-19020__li446818403118">If no, go to <a href="#ALM-19020__li11347192371020">5</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-19020__p67011652191116">Check the RegionServer region distribution in the active cluster.</p>
|
|
<ol start="5" id="ALM-19020__ol157798528117"><li id="ALM-19020__li11347192371020"><a name="ALM-19020__li11347192371020"></a><a name="li11347192371020"></a><span>On FusionInsight Manager, choose <strong id="ALM-19020__b1382454111911">Cluster</strong> > <em id="ALM-19020__i1683010415195">Name of the desired cluster</em> > <strong id="ALM-19020__b38308416196">Services</strong> > <strong id="ALM-19020__b188301241141915">HBase</strong>. Click <strong id="ALM-19020__b158301441171917">HMaster(Active)</strong> to go to the web UI of the HBase instance and check whether regions are evenly distributed on the Region Server.</span></li><li id="ALM-19020__li277716529115"><a name="ALM-19020__li277716529115"></a><a name="li277716529115"></a><span>Log in to the faulty RegionServer node as user <strong id="ALM-19020__b177771652151114">omm</strong>.</span></li><li id="ALM-19020__li2777552141115"><span>Run the following commands to go to the client installation directory and set the environment variable:</span><p><p id="ALM-19020__p147778527112"><strong id="ALM-19020__b977711525117">cd </strong><em id="ALM-19020__i18777205214111">Client installation directory</em></p>
|
|
<p id="ALM-19020__p20777125211117"><strong id="ALM-19020__b16777205210110">source bigdata_env</strong></p>
|
|
<p id="ALM-19020__p477765218119">If the cluster uses the security mode, perform security authentication. Run the <strong id="ALM-19020__b043591532310">kinit hbase</strong> command and enter the password as prompted (obtain the password from the MRS cluster administrator).</p>
|
|
</p></li><li id="ALM-19020__li1677845221113"><span>Run the following commands to check whether the load balancing function is enabled.</span><p><p id="ALM-19020__p1077714527115"><strong id="ALM-19020__b157771552161115">hbase shell</strong></p>
|
|
<div class="p" id="ALM-19020__p1977835261117"><strong id="ALM-19020__b1077725221110">balancer_enabled</strong><ul id="ALM-19020__ul10778952191112"><li id="ALM-19020__li157781352121116">If yes, go to <a href="#ALM-19020__li127781952161113">10</a>.</li><li id="ALM-19020__li1778552161113">If no, go to <a href="#ALM-19020__li8778145241118">9</a>.</li></ul>
|
|
</div>
|
|
</p></li><li id="ALM-19020__li8778145241118"><a name="ALM-19020__li8778145241118"></a><a name="li8778145241118"></a><span>Run the following commands in HBase Shell to enable the load balancing function and check whether the function is enabled.</span><p><p id="ALM-19020__p137782052141119"><strong id="ALM-19020__b97784522117">balance_switch true</strong></p>
|
|
<p id="ALM-19020__p107781152121116"><strong id="ALM-19020__b1977817526116">balancer_enabled</strong></p>
|
|
</p></li><li id="ALM-19020__li127781952161113"><a name="ALM-19020__li127781952161113"></a><a name="li127781952161113"></a><span>Run the <strong id="ALM-19020__b1899174619155">balancer</strong> command to manually trigger the load balancing function.</span><p><div class="note" id="ALM-19020__note977865214116"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-19020__p157781152181112">You are advised to enable and manually trigger the load balancing function during off-peak hours.</p>
|
|
</div></div>
|
|
</p></li><li id="ALM-19020__li11778052191116"><span>Check whether the alarm is cleared.</span><p><ul id="ALM-19020__ul19778952111118"><li id="ALM-19020__li1778552151117">If yes, no further action is required.</li><li id="ALM-19020__li157781052111116">If no, go to <a href="#ALM-19020__li14354010126">12</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-19020__p032210116123">Check the HBase service scale of the standby cluster.</p>
|
|
<ol start="12" id="ALM-19020__ol63557115127"><li id="ALM-19020__li14354010126"><a name="ALM-19020__li14354010126"></a><a name="li14354010126"></a><span>Expand the HBase cluster, add a node, and add a RegionServer instance on the node. Then, perform <a href="#ALM-19020__li277716529115">6</a> to <a href="#ALM-19020__li127781952161113">10</a> to enable the load balancing function and manually trigger it.</span></li><li id="ALM-19020__li1735513118122"><span>On FusionInsight Manager, choose <strong id="ALM-19020__b13204121031811">Cluster</strong> > <em id="ALM-19020__i3215201013189">Name of the desired cluster</em> > <strong id="ALM-19020__b142171210161813">Services</strong> > <strong id="ALM-19020__b18219151031818">HBase</strong>. Click <strong id="ALM-19020__b152211410151816">HMaster(Active)</strong> to go to the web UI of the HBase instance, refresh the page, and check whether regions are evenly distributed.</span><p><ul id="ALM-19020__ul1235418119129"><li id="ALM-19020__li3354181101217">If yes, go to <a href="#ALM-19020__li435514181217">14</a>.</li><li id="ALM-19020__li113546131214">If no, go to <a href="#ALM-19020__li193977212510">15</a>.</li></ul>
|
|
</p></li><li id="ALM-19020__li435514181217"><a name="ALM-19020__li435514181217"></a><a name="li435514181217"></a><span>Check whether the alarm is cleared.</span><p><ul id="ALM-19020__ul1235515151214"><li id="ALM-19020__li163556131220">If yes, no further action is required.</li><li id="ALM-19020__li03556115121">If no, go to <a href="#ALM-19020__li193977212510">15</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-19020__p1438211216252"><strong id="ALM-19020__b1164110288315">Collect the fault information.</strong></p>
|
|
<ol start="15" id="ALM-19020__ol1239819219251"><li id="ALM-19020__li193977212510"><a name="ALM-19020__li193977212510"></a><a name="li193977212510"></a><span>On FusionInsight Manager of the standby cluster, choose <strong id="ALM-19020__b49821714142917">O&M</strong>. In the navigation pane on the left, choose <strong id="ALM-19020__b898412146298">Log</strong> > <strong id="ALM-19020__b1198591402914">Download</strong>.</span></li><li id="ALM-19020__li6397128257"><span>Expand the <strong id="ALM-19020__b63276662013">Service</strong> drop-down list, and select <strong id="ALM-19020__b8327962202">HBase</strong> for the target cluster.</span></li><li id="ALM-19020__li16397122152519"><span>Click <span><img id="ALM-19020__image539772142518" src="en-us_image_0000001159847251.png"></span> in the upper right corner, and set <strong id="ALM-19020__b1353914704417">Start Date</strong> and <strong id="ALM-19020__b16539978442">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-19020__b85400754418">Download</strong>.</span></li><li id="ALM-19020__li1339817292514"><span>Contact <span id="ALM-19020__text16418127174317">O&M personnel</span> and provide the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-19020__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-19020__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
|
|
</div>
|
|
<div class="section" id="ALM-19020__section666819462133"><h4 class="sectiontitle">Related Information</h4><p id="ALM-19020__p1166844661311">None</p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|