doc-exports/docs/mrs/umn/ALM-12180.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

102 lines
12 KiB
HTML

<a name="ALM-12180"></a><a name="ALM-12180"></a>
<h1 class="topictitle1">ALM-12180 Suspended Disk I/O</h1>
<div id="body0000001353935630"><div class="section" id="ALM-12180__section14673296256"><h4 class="sectiontitle">Description</h4><ul id="ALM-12180__ul2014714514423"><li id="ALM-12180__li1614775174215">For HDDs, the alarm is triggered when any of the following conditions is met:<ul id="ALM-12180__ul17332858184213"><li id="ALM-12180__li3477654124215">The system collects data every 3 seconds, and detects that the <strong id="ALM-12180__b187823011515">svctm</strong> value exceeds 6s for 10 consecutive periods within 30 seconds.</li><li id="ALM-12180__li44771354124218">The system collects data every 3 seconds, and detects that the <strong id="ALM-12180__b1552492416563">avgqu-sz</strong> value is greater than 0, the IOPS or bandwidth is 0, and the <strong id="ALM-12180__b165881215175717">ioutil</strong> value is greater than <strong id="ALM-12180__b39725306574">99%</strong> for 10 consecutive periods within 30 seconds.</li></ul>
</li><li id="ALM-12180__li1447755464219">For SSDs, the alarm is triggered when any of the following conditions is met:<ul id="ALM-12180__ul1255610134312"><li id="ALM-12180__li15477195414210">The system collects data every 3 seconds, and detects that the <strong id="ALM-12180__b1360517433619">svctm</strong> value exceeds 3s for 10 consecutive periods within 30 seconds.</li><li id="ALM-12180__li947865419422">The system collects data every 3 seconds, and detects that the <strong id="ALM-12180__b134091219815">avgqu-sz</strong> value is greater than 0, the IOPS or bandwidth is 0, and the <strong id="ALM-12180__b144071210812">ioutil</strong> value is greater than <strong id="ALM-12180__b1141612088">99%</strong> for 10 consecutive periods within 30 seconds.</li></ul>
</li></ul>
<p id="ALM-12180__p4178195414013">This alarm is automatically cleared when the preceding conditions have not been met for 90s.</p>
<div class="note" id="ALM-12180__note475192654512"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="ALM-12180__ul141039011819"><li id="ALM-12180__li110313041817">Run the following command in the OS to collect data:<p id="ALM-12180__p656484785012"><a name="ALM-12180__li110313041817"></a><a name="li110313041817"></a><strong id="ALM-12180__b15753163944513">iostat -x -t 1 1</strong></p>
<p id="ALM-12180__p20617203714515"><span><img id="ALM-12180__image10873168181011" src="en-us_image_0000001375901064.png"></span></p>
<p id="ALM-12180__p519784644617">Parameters are as follows:</p>
<ul id="ALM-12180__ul128810545469"><li id="ALM-12180__li8882175444617"><strong id="ALM-12180__b3886103261414">avgqu-sz</strong> indicates the disk queue depth.</li><li id="ALM-12180__li10882115414612">The sum of <strong id="ALM-12180__b783337151">r/s</strong> and <strong id="ALM-12180__b1069917412158">w/s</strong> is the IOPS.</li><li id="ALM-12180__li688245444615">The sum of <strong id="ALM-12180__b15363521121516">rkB/s</strong> and <strong id="ALM-12180__b5718623131513">wkB/s</strong> is the bandwidth.</li><li id="ALM-12180__li4882175404617"><strong id="ALM-12180__b149598261610">%util</strong> is the <strong id="ALM-12180__b10567142391617">ioutil</strong> value.</li></ul>
</li><li id="ALM-12180__li1458717157180">The formula for calculating <strong id="ALM-12180__b19511115375210">svctm</strong> is as follows:<p id="ALM-12180__p332417335118">svctm = (tot_ticks_new - tot_ticks_old) / (rd_ios_new + wr_ios_new - rd_ios_old - wr_ios_old)</p>
<p id="ALM-12180__p4167121643616">If <strong id="ALM-12180__b169718121611">rd_ios_new + wr_ios_new - rd_ios_old - wr_ios_old</strong> is <strong id="ALM-12180__b8972101216616">0</strong>, then <strong id="ALM-12180__b8972712666">svctm</strong> is <strong id="ALM-12180__b15972181217617">0</strong>.</p>
<p id="ALM-12180__p1268752201517">The parameters can be obtained as follows:</p>
<p id="ALM-12180__p5648122416463">The system runs the <strong id="ALM-12180__b3987117155318">cat /proc/diskstats</strong> command every 3 seconds to collect data. For example:</p>
<p id="ALM-12180__p1657515122539"><span><img id="ALM-12180__image1675110291273" src="en-us_image_0000001426500589.png"></span></p>
<p id="ALM-12180__p146243408539">In these two commands:</p>
<p id="ALM-12180__p1264621195310">In the data collected for the first time, the number in the fourth column is the <strong id="ALM-12180__b2974115335316">rd_ios_old</strong> value, the number in the eighth column is the <strong id="ALM-12180__b1197410533532">wr_ios_old</strong> value, and the number in the thirteenth column is the <strong id="ALM-12180__b0974125312533">tot_ticks_old</strong> value.</p>
<p id="ALM-12180__p415119825410">In the data collected for the second time, the number in the fourth column is the <strong id="ALM-12180__b5467171016545">rd_ios_new</strong> value, the number in the eighth column is the <strong id="ALM-12180__b746711016548">wr_ios_new</strong> value, and the number in the thirteenth column is the <strong id="ALM-12180__b446716104542">tot_ticks_new</strong> value.</p>
<p id="ALM-12180__p1328974985416">In this case, the value of <strong id="ALM-12180__b71451317175415">svctm</strong> is as follows:</p>
<p id="ALM-12180__p296819576542">(19571460 - 19569526) / (1101553 + 28747977 - 1101553 - 28744856) = 0.6197</p>
</li></ul>
</div></div>
</div>
<div class="section" id="ALM-12180__section28308296"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-12180__table36969235" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-12180__row42433012"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-12180__p14521914">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-12180__p35424385">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-12180__p50802928">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-12180__row21396528"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-12180__p155960316331">12180</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-12180__p359323113318">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-12180__p175874319336">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-12180__section53448080"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-12180__table33617909" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-12180__row23730911"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-12180__p43155662">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-12180__p5947729">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-12180__row96067296346"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12180__p152081234181415">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12180__p16208734181415">Specifies the cluster or system for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-12180__row28589139"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12180__p182081134141417">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12180__p16208103411414">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-12180__row7926750304"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12180__p182089346142">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12180__p9208203419143">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-12180__row1437219114715"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12180__p1643841912479">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12180__p243891964715">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-12180__row14438219104712"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12180__p20438101994719">DiskName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12180__p184381019144712">Specifies the disk for which the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-12180__section14442155121012"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-12180__p815335724110">A continuously high I/O usage may adversely affect service operations and result in service loss.</p>
</div>
<div class="section" id="ALM-12180__section18133852349"><h4 class="sectiontitle">Possible Causes</h4><p id="ALM-12180__p1912711182110">The disk is aged.</p>
</div>
<div class="section" id="ALM-12180__section1262841419113"><h4 class="sectiontitle">Procedure</h4><p id="ALM-12180__p1160151292144"><strong id="ALM-12180__b154472341219">Replace the disk.</strong></p>
<ol id="ALM-12180__ol15090529811"><li id="ALM-12180__li1450914529813"><span>Log in to FusionInsight Manager and choose <strong id="ALM-12180__b94581437182117">O&amp;M</strong> &gt; <strong id="ALM-12180__b1245903711215">Alarm</strong> &gt; <strong id="ALM-12180__b2459163732117">Alarms</strong>.</span></li><li id="ALM-12180__li3509952087"><span>View the detailed information about the alarm. Check the values of <strong id="ALM-12180__b0301045202113">HostName</strong> and <strong id="ALM-12180__b731194516213">DiskName</strong> in the location information to obtain the information about the faulty disk for which the alarm is reported.</span></li><li id="ALM-12180__li135093521587"><span>Replace the hard disk.</span></li><li id="ALM-12180__li850919521818"><span>Check whether the alarm is cleared.</span><p><ul id="ALM-12180__ul1850985214816"><li id="ALM-12180__li135098521185">If yes, no further action is required.</li><li id="ALM-12180__li55095521088">If no, go to <a href="#ALM-12180__li1050815217817">5</a>.</li></ul>
</p></li></ol>
<p id="ALM-12180__p98841749221"><strong id="ALM-12180__b10230111620293">Collect fault information.</strong></p>
<ol start="5" id="ALM-12180__ol7509052983"><li id="ALM-12180__li1050815217817"><a name="ALM-12180__li1050815217817"></a><a name="li1050815217817"></a><span>On FusionInsight Manager, choose <strong id="ALM-12180__b7784618132913">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-12180__b167847184294">Log</strong> &gt; <strong id="ALM-12180__b978461810291">Download</strong>.</span></li><li id="ALM-12180__li135081527810"><span>Select <strong id="ALM-12180__b184031528182919">OMS</strong> for <strong id="ALM-12180__b1540372818299">Service</strong> and click <strong id="ALM-12180__b10404192818297">OK</strong>.</span></li><li id="ALM-12180__li95084527819"><span>Click <span><img id="ALM-12180__image050820521384" src="en-us_image_0000001405224197.png"></span> in the upper right corner, and set <strong id="ALM-12180__b1436133532917">Start Date</strong> and <strong id="ALM-12180__b153683517292">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-12180__b13711352294">Download</strong>.</span></li><li id="ALM-12180__li950915529812"><span>Contact <span id="ALM-12180__text65081352585">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-12180__section7293173912175"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-12180__p43102834211">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-12180__section8293639131715"><h4 class="sectiontitle">Related Information</h4><p id="ALM-12180__p13293103921715">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>