doc-exports/docs/mrs/umn/ALM-14010.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

108 lines
19 KiB
HTML

<a name="ALM-14010"></a><a name="ALM-14010"></a>
<h1 class="topictitle1">ALM-14010 NameService Service Is Abnormal</h1>
<div id="body16291030"><div class="section" id="ALM-14010__section37287599"><h4 class="sectiontitle">Description</h4><p id="ALM-14010__p44505092">The system checks the NameService service status every 180 seconds. This alarm is generated when the NameService service is unavailable.</p>
<p id="ALM-14010__p65001513">This alarm is cleared when the NameService service recovers.</p>
</div>
<div class="section" id="ALM-14010__section44077"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14010__table30631213" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14010__row18244538"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14010__p1412591">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14010__p47311061">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14010__p6990739">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14010__row29379026"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14010__p30890938">14010</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14010__p19138034">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14010__p6676898">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14010__section396696"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14010__table3957875" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14010__row35526989"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14010__p59113859">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14010__p23493245">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14010__row149727275317"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14010__p156438591896">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14010__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14010__row23904675"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14010__p65062640">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14010__p5163670">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14010__row46473036"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14010__p35626567">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14010__p34025080">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14010__row37790264"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14010__p51620924">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14010__p41783216">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14010__row40504628"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14010__p59649460">NameServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14010__p66876973">Specifies the NameService for which the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14010__section3570265"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14010__p48325702">HDFS fails to provide services for upper-layer components based on the NameService service, such as HBase and MapReduce. As a result, users cannot read or write files.</p>
</div>
<div class="section" id="ALM-14010__section32132385"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14010__ul22067751"><li id="ALM-14010__li64392034">The KrbServer service is abnormal.</li><li id="ALM-14010__li42657401">The JournalNode is faulty.</li><li id="ALM-14010__li48372291">The DataNode is faulty.</li><li id="ALM-14010__li32697443">The disk capacity is insufficient.</li><li id="ALM-14010__li25841537">The NameNode enters safe mode.</li></ul>
</div>
<div class="section" id="ALM-14010__section20756009"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-14010__p12789763"><strong id="ALM-14010__b721516256428">Check the KrbServer service status.</strong></p>
<ol id="ALM-14010__ol299365291790"><li id="ALM-14010__li49048417852"><span>On FusionInsight Manager, choose <strong id="ALM-14010__b1120915339427">Cluster</strong> &gt; <em id="ALM-14010__i15198367426">Name of the desired cluster</em> &gt; <strong id="ALM-14010__b17339848134219">Services</strong>.</span></li><li id="ALM-14010__li1264678117852"><span>Check whether the KrbServer service exists.</span><p><ul class="subitemlist" id="ALM-14010__ul3868790017852"><li id="ALM-14010__li3972922917852">If yes, go to <a href="#ALM-14010__li4671216717852">3</a>.</li><li id="ALM-14010__li6395097917852">If no, go to <a href="#ALM-14010__li2979505817852">6</a>.</li></ul>
</p></li><li id="ALM-14010__li4671216717852"><a name="ALM-14010__li4671216717852"></a><a name="li4671216717852"></a><span>Click <strong id="ALM-14010__b164085816425">KrbServer</strong>.</span></li><li id="ALM-14010__li5139002817852"><span>Click <strong id="ALM-14010__b1499907615113152">Instances</strong>. On the KrbServer management page, select the faulty instance, and choose <strong id="ALM-14010__b7930192215503">More</strong> &gt; <strong id="ALM-14010__b9931182210509">Restart Instance</strong>. Check whether the instance successfully restarts.</span><p><ul class="subitemlist" id="ALM-14010__ul6536232617852"><li id="ALM-14010__li2558918717852">If yes, go to <a href="#ALM-14010__li1076710217852">5</a>.</li><li id="ALM-14010__li5945826317852">If no, go to <a href="#ALM-14010__li5097747017852">24</a>.</li></ul>
</p></li><li id="ALM-14010__li1076710217852"><a name="ALM-14010__li1076710217852"></a><a name="li1076710217852"></a><span>Choose <strong id="ALM-14010__b1799465720113152">O&amp;M</strong> &gt; <strong id="ALM-14010__b986335617113152">Alarm </strong>&gt; <strong id="ALM-14010__b190167493113152">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14010__ul1504600817852"><li id="ALM-14010__li5985707417852">If yes, no further action is required.</li><li id="ALM-14010__li1658485917852">If no, go to <a href="#ALM-14010__li2979505817852">6</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14010__p119634417852"><strong id="ALM-14010__b121016764420">Check the JournalNode instance status.</strong></p>
<ol start="6" id="ALM-14010__ol5440624717917"><li id="ALM-14010__li2979505817852"><a name="ALM-14010__li2979505817852"></a><a name="li2979505817852"></a><span>On FusionInsight Manager, choose <strong id="ALM-14010__b789155216457">Cluster</strong> &gt; <em id="ALM-14010__i19535213452">Name of the desired cluster</em> &gt; <strong id="ALM-14010__b1595125294518">Services</strong>.</span></li><li id="ALM-14010__li6682893317852"><span>Choose <strong id="ALM-14010__b115441121514">HDFS</strong> &gt; <strong id="ALM-14010__b088151485112">Instances</strong>.</span></li><li id="ALM-14010__li5663707817852"><span>Check whether the <strong id="ALM-14010__b39712183515">Running Status</strong> of the JournalNode is <strong id="ALM-14010__b1697881812511">Normal</strong>.</span><p><ul class="subitemlist" id="ALM-14010__ul1374954917852"><li id="ALM-14010__li6436604317852">If yes, go to <a href="#ALM-14010__li1229459717852">11</a>.</li><li id="ALM-14010__li4626697017852">If no, go to <a href="#ALM-14010__li34233917852">9</a>.</li></ul>
</p></li><li id="ALM-14010__li34233917852"><a name="ALM-14010__li34233917852"></a><a name="li34233917852"></a><span>Select the faulty JournalNode, and choose <strong id="ALM-14010__b2735183616461">More</strong> &gt; <strong id="ALM-14010__b17741143620463">Restart Instance</strong>. Check whether the JournalNode successfully restarts.</span><p><ul class="subitemlist" id="ALM-14010__ul5969036117852"><li id="ALM-14010__li2420056617852">If yes, go to <a href="#ALM-14010__li136606617852">10</a>.</li><li id="ALM-14010__li1408880217852">If no, go to <a href="#ALM-14010__li5097747017852">24</a>.</li></ul>
</p></li><li id="ALM-14010__li136606617852"><a name="ALM-14010__li136606617852"></a><a name="li136606617852"></a><span>Choose <strong id="ALM-14010__b2069838015113152">O&amp;M</strong> &gt; <strong id="ALM-14010__b244120423113152">Alarm </strong>&gt; <strong id="ALM-14010__b141811068113152">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14010__ul3150003517852"><li id="ALM-14010__li308105917852">If yes, no further action is required.</li><li id="ALM-14010__li4823924617852">If no, go to <a href="#ALM-14010__li1229459717852">11</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14010__p1506486617852"><strong id="ALM-14010__b7568161924419">Check the DataNode instance status.</strong></p>
<ol start="11" id="ALM-14010__ol4671550617933"><li id="ALM-14010__li1229459717852"><a name="ALM-14010__li1229459717852"></a><a name="li1229459717852"></a><span>On FusionInsight Manager, choose <strong id="ALM-14010__b964234219465">Cluster</strong> &gt; <em id="ALM-14010__i196481042144610">Name of the desired cluster</em> &gt; <strong id="ALM-14010__b19648842174610">Services</strong> &gt; <strong id="ALM-14010__b848914557469">HDFS</strong>.</span></li><li id="ALM-14010__li193569017852"><span>Click <strong id="ALM-14010__b74238191260">Instances</strong> and check whether <strong id="ALM-14010__b46788221868">Running Status</strong> of all DataNodes is <strong id="ALM-14010__b1695013718615">Normal</strong>.</span><p><ul class="subitemlist" id="ALM-14010__ul21507617852"><li id="ALM-14010__li3728299717852">If yes, go to <a href="#ALM-14010__li6155970417852">15</a>.</li><li id="ALM-14010__li2389717852">If no, go to <a href="#ALM-14010__li6039615117852">13</a>.</li></ul>
</p></li><li id="ALM-14010__li6039615117852"><a name="ALM-14010__li6039615117852"></a><a name="li6039615117852"></a><span>Click <strong id="ALM-14010__b1809656212113152">Instances</strong>. On the DataNode management page, select the faulty instance, and choose <strong id="ALM-14010__b1342512562453">More</strong> &gt; <strong id="ALM-14010__b1843118563454">Restart Instance</strong>. Check whether the DataNode successfully restarts.</span><p><ul class="subitemlist" id="ALM-14010__ul1416722317852"><li id="ALM-14010__li2257316717852">If yes, go to <a href="#ALM-14010__li2920958817852">14</a>.</li><li id="ALM-14010__li1648721617852">If no, go to <a href="#ALM-14010__li6155970417852">15</a>.</li></ul>
</p></li><li id="ALM-14010__li2920958817852"><a name="ALM-14010__li2920958817852"></a><a name="li2920958817852"></a><span>Choose <strong id="ALM-14010__b2055546434113152">O&amp;M</strong> &gt; <strong id="ALM-14010__b1330435913113152">Alarm </strong>&gt; <strong id="ALM-14010__b2137922568113152">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14010__ul4841387217852"><li id="ALM-14010__li669444717852">If yes, no further action is required.</li><li id="ALM-14010__li537931917852">If no, go to <a href="#ALM-14010__li6155970417852">15</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14010__p3307167117852"><strong id="ALM-14010__b4122670017939">Check disk status.</strong></p>
<ol start="15" id="ALM-14010__ol4905050317952"><li id="ALM-14010__li6155970417852"><a name="ALM-14010__li6155970417852"></a><a name="li6155970417852"></a><span>On FusionInsight Manager, choose <strong id="ALM-14010__b29211641366">Cluster</strong> &gt; <em id="ALM-14010__i10921124114614">Name of the desired cluster</em> &gt; <strong id="ALM-14010__b19922124115610">Host</strong>.</span></li><li id="ALM-14010__li4816398217852"><span>In the <strong id="ALM-14010__b3776831132018">Disk</strong> column, check whether the disk space is insufficient.</span><p><ul class="subitemlist" id="ALM-14010__ul2026463417852"><li id="ALM-14010__li2028012917852">If yes, go to <a href="#ALM-14010__li3082265617852">17</a>.</li><li id="ALM-14010__li3207778717852">If no, go to <a href="#ALM-14010__li6295063617852">19</a>.</li></ul>
</p></li><li id="ALM-14010__li3082265617852"><a name="ALM-14010__li3082265617852"></a><a name="li3082265617852"></a><span>Expand the disk capacity. </span></li><li id="ALM-14010__li2190759617852"><span>Choose <strong id="ALM-14010__b847685918113152">O&amp;M</strong> &gt; <strong id="ALM-14010__b652389345113152">Alarm </strong>&gt; <strong id="ALM-14010__b1726937408113152">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14010__ul2843961617852"><li id="ALM-14010__li896844917852">If yes, no further action is required.</li><li id="ALM-14010__li5535574017852">If no, go to <a href="#ALM-14010__li6295063617852">19</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14010__p5462996017852"><strong id="ALM-14010__b5513124410402">Check whether NameNode is in the safe mode.</strong></p>
<ol start="19" id="ALM-14010__ol27876677171013"><li id="ALM-14010__li6295063617852"><a name="ALM-14010__li6295063617852"></a><a name="li6295063617852"></a><span>On FusionInsight Manager, choose <strong id="ALM-14010__b1064623144720">Cluster</strong> &gt; <em id="ALM-14010__i15759112612473">Name of the desired cluster</em> &gt; <strong id="ALM-14010__b1340993304715">Services</strong> &gt; <strong id="ALM-14010__b2279123414478">HDFS</strong>. Click <strong id="ALM-14010__b7335123714712">NameNode(Active)</strong> of the abnormal NameService. The NameNode web UI is displayed.</span><p><div class="note" id="ALM-14010__note184603141102"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-14010__en-us_topic_0193189480_en-us_topic_0070539288_p3460314151015">By default, the admin user does not have the management rights of other components. If the page cannot be opened or the content is not completely displayed due to insufficient permission when you access the native page of a component, you can manually create a user with the management rights of the corresponding component to log in to the component.</p>
</div></div>
</p></li><li id="ALM-14010__li4698013317852"><span>On the NameNode web UI, check whether "Safe mode is ON." is displayed.</span><p><p class="litext" id="ALM-14010__p2968481917852">Information behind <strong id="ALM-14010__b1846656379113152">Safe mode is ON</strong> is alarm information and is displayed based actual conditions.</p>
<ul class="subitemlist" id="ALM-14010__ul4250271717852"><li id="ALM-14010__li5566015917852">If yes, go to <a href="#ALM-14010__li5459096817852">21</a>.</li><li id="ALM-14010__li1217906417852">If no, go to <a href="#ALM-14010__li5097747017852">24</a>.</li></ul>
</p></li><li id="ALM-14010__li5459096817852"><a name="ALM-14010__li5459096817852"></a><a name="li5459096817852"></a><span>Log in to the client as user <strong id="ALM-14010__b5272194112443">root</strong>. <span id="ALM-14010__text85258205227"></span> Run the <strong id="ALM-14010__b791421174517">cd</strong> command to go to the client installation directory and run the <strong id="ALM-14010__b253682474513">source bigdata_env</strong> command. If the cluster uses the security mode, perform security authentication. Run the <strong id="ALM-14010__b17911210154916">kinit hdfs</strong> command and enter the password as prompted. The password can be obtained from the MRS cluster administrator. If the cluster uses the non-security mode, log in as user <strong id="ALM-14010__b8214652174919">omm</strong> and run the command. Ensure that user <strong id="ALM-14010__b1368713110508">omm</strong> has the client execution permission.</span></li><li id="ALM-14010__li5979226317852"><span>Run <strong id="ALM-14010__b1569542817452">hdfs dfsadmin -safemode leave</strong>.</span></li><li id="ALM-14010__li1312070317852"><span>Choose <strong id="ALM-14010__b759788459113152">O&amp;M</strong> &gt; <strong id="ALM-14010__b1113377718113152">Alarm </strong>&gt; <strong id="ALM-14010__b1115504115113152">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14010__ul4572973117852"><li id="ALM-14010__li125945817852">If yes, no further action is required.</li><li id="ALM-14010__li3490724317852">If no, go to <a href="#ALM-14010__li5097747017852">24</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14010__p891439617852"><strong id="ALM-14010__b16581546619">Collect the fault information.</strong></p>
<ol start="24" id="ALM-14010__ol10367908171020"><li id="ALM-14010__li5097747017852"><a name="ALM-14010__li5097747017852"></a><a name="li5097747017852"></a><span>On FusionInsight Manager, choose <strong id="ALM-14010__b11041054202717">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-14010__b2115175402717">Log</strong> &gt; <strong id="ALM-14010__b10116105413278">Download</strong>.</span></li><li id="ALM-14010__li5971352317852"><span>In the <strong id="ALM-14010__b634814782113152">Service </strong>area, select the following nodes of the desired cluster.</span><p><ul class="subitemlist" id="ALM-14010__ul5137407817852"><li id="ALM-14010__li5614404817852">ZooKeeper</li><li id="ALM-14010__li3553439217852">HDFS</li></ul>
</p></li><li id="ALM-14010__li55080017852"><span>Click <span><img id="ALM-14010__image104601319175315" src="en-us_image_0263895680.png"></span> in the upper right corner, and set <strong id="ALM-14010__b1663834666113152">Start Date</strong> and <strong id="ALM-14010__b1251116690113152">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14010__b1498166116113152">Download</strong>.</span></li><li id="ALM-14010__li4461486317852"><span>Contact <span id="ALM-14010__text1855510394424">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-14010__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14010__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-14010__section52586354"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14010__p838162">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>