forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
129 lines
17 KiB
HTML
129 lines
17 KiB
HTML
<a name="ALM-45643"></a><a name="ALM-45643"></a>
|
|
|
|
<h1 class="topictitle1">ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold</h1>
|
|
<div id="body0000002008221041"><p id="ALM-45643__p12261122253615">This section applies to MRS 3.3.0 or later.</p>
|
|
<div class="section" id="ALM-45643__section663215"><h4 class="sectiontitle"><span id="ALM-45643__text516373020197">Alarm Description</span></h4><p id="ALM-45643__p66405588">The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (<strong id="ALM-45643__b862311817011">metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration</strong>, 180s by default). This alarm is generated when the MemTable size of RocksDB for a job continuously exceeds the threshold (<strong id="ALM-45643__b255095214018">metrics.reporter.alarm.job.alarm.rocksdb.get.micros.threshold</strong>, 50000 microseconds by default). This alarm is cleared when the MemTable size of RocksDB for the job is less than or equal to the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45643__section5968939"><h4 class="sectiontitle"><span id="ALM-45643__text20591447192117">Alarm Attributes</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45643__table10143581" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45643__row61411666"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.1"><p id="ALM-45643__p17386810"><span id="ALM-45643__text1864783145211">Alarm ID</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.2"><p id="ALM-45643__p66154394"><span id="ALM-45643__text297913110521">Alarm Severity</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.3"><p id="ALM-45643__p49230886"><span id="ALM-45643__text0890175712305">Auto Cleared</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45643__row49774232"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.1 "><p id="ALM-45643__p5180964">45643</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.2 "><p id="ALM-45643__p17004965">Minor</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.3 "><p id="ALM-45643__p35224963">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45643__section53720453"><h4 class="sectiontitle"><span id="ALM-45643__text18171442142214">Alarm Parameters</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45643__table34649765" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45643__row18974100"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.1"><p id="ALM-45643__p42699947"><span id="ALM-45643__text6203173410617">Parameter</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.2"><p id="ALM-45643__p36143663"><span id="ALM-45643__text10819164319610">Description</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45643__row16272251424"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45643__p9447153994219">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45643__p144723994214">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45643__row38292076"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45643__p164471639194216">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45643__p44471639174211">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45643__row73049270124"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45643__p24471539104219">ApplicationName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45643__p1944763954213">Specifies the name of the application for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45643__row9875225"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45643__p1244715394427">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45643__p44471439144216">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45643__row13243689"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45643__p1244716397426">JobName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45643__p244713917425">Specifies the job for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45643__section13722030"><h4 class="sectiontitle"><span id="ALM-45643__text98201443182317">Impact on the System</span></h4><p id="ALM-45643__p25040094">This alarm has no adverse impact on the system.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45643__section56389407"><h4 class="sectiontitle"><span id="ALM-45643__text11871546172411">Possible Causes</span></h4><p id="ALM-45643__p586761744614">The write pressure of RocksDB is high.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45643__section178029519112"><h4 class="sectiontitle"><span id="ALM-45643__text79051154102518">Handling Procedure</span></h4><p id="ALM-45643__p365051817011"><strong id="ALM-45643__b923914178616">Check TaskManager logs for the write pressure of RocksDB and collect logs.</strong></p>
|
|
<ol id="ALM-45643__ol148115212482"><li id="ALM-45643__li1648145217484"><span>Log in to FusionInsight Manager as a user who has the FlinkServer management permission.</span></li><li id="ALM-45643__li143241423112412"><span>Choose <strong id="ALM-45643__b4552762293">O&M</strong> > <strong id="ALM-45643__b12552764297">Alarm</strong> > <strong id="ALM-45643__b16553116202910">Alarms</strong> > <strong id="ALM-45643__b17553166192910">ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold</strong>, view <strong id="ALM-45643__b1355412611297">Location</strong>, and obtain the name of the task for which the alarm is generated.</span></li><li id="ALM-45643__li1548155217481"><span>Choose <strong id="ALM-45643__b1184012196362">Cluster</strong> > <strong id="ALM-45643__b118401719173617">Services</strong> > <strong id="ALM-45643__b15840319173617">Yarn</strong> and click the link next to <strong id="ALM-45643__b48401019153614">ResourceManager WebUI</strong> to go to the native Yarn page.</span></li></ol><ol start="4" id="ALM-45643__ol204835294814"><li id="ALM-45643__li174855264820"><span>Locate the abnormal task based on its name displayed in <strong id="ALM-45643__b149721252163419">Location</strong>, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.</span><p><div class="fignone" id="ALM-45643__en-us_topic_0000001445372489_fig1390461517192"><span class="figcap"><b>Figure 1 </b>Application ID of a job</span><br><span><img id="ALM-45643__image11914112105516" src="en-us_image_0000001971808442.png"></span></div>
|
|
<ul id="ALM-45643__ul15821111193614"><li id="ALM-45643__li11829112367">If yes, go to <a href="#ALM-45643__li14941184217233">5</a>.</li><li id="ALM-45643__li148216113364">If no, go to <a href="#ALM-45643__li382131054218">6</a>.</li></ul>
|
|
</p></li><li id="ALM-45643__li14941184217233"><a name="ALM-45643__li14941184217233"></a><a name="li14941184217233"></a><span>Click the application ID of the failed job to go to the job page.</span><p><ol type="a" id="ALM-45643__en-us_topic_0000001445372489_ol18905161513191"><li id="ALM-45643__en-us_topic_0000001445372489_li090431510192">Click <strong id="ALM-45643__b18526171933515">Logs</strong> in the <strong id="ALM-45643__b1352761903510">Logs</strong> column to view JobManager logs.<div class="fignone" id="ALM-45643__en-us_topic_0000001445372489_fig0904115131915"><span class="figcap"><b>Figure 2 </b>Clicking Logs</span><br><span><img id="ALM-45643__en-us_topic_0000001445372489_image290471501913" src="en-us_image_0000002008248449.png"></span></div>
|
|
</li><li id="ALM-45643__en-us_topic_0000001445372489_li232434015269">Click the ID in the <strong id="ALM-45643__b14216234193514">Attempt ID</strong> column and click <strong id="ALM-45643__b6217934113512">Logs</strong> in the <strong id="ALM-45643__b13217193423518">Logs</strong> column to view and save TaskManager logs. Then go to <a href="#ALM-45643__li9907135324219">7</a>.<div class="fignone" id="ALM-45643__en-us_topic_0000001445372489_fig16904101571920"><span class="figcap"><b>Figure 3 </b>Clicking the ID in the Attempt ID column</span><br><span><img id="ALM-45643__en-us_topic_0000001445372489_image1890411511199" src="en-us_image_0000001971648702.png"></span></div>
|
|
<div class="fignone" id="ALM-45643__en-us_topic_0000001445372489_fig67971748144610"><span class="figcap"><b>Figure 4 </b>Clicking Logs</span><br><span><img id="ALM-45643__en-us_topic_0000001445372489_image1620681118112" src="en-us_image_0000002008129021.png"></span></div>
|
|
<div class="note" id="ALM-45643__en-us_topic_0000001445372489_note126111528152718"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-45643__en-us_topic_0000001445372489_p14611162814277">You can also log in to Manager as a user who has the management permission for the current Flink job. Choose <strong id="ALM-45643__b3389027184918">Cluster</strong> > <strong id="ALM-45643__b15389172719494">Services</strong> > <strong id="ALM-45643__b8389162754914">Flink</strong>, and click the link next to <strong id="ALM-45643__b163901827174915">Flink WebUI</strong>. On the displayed Flink web UI, click <strong id="ALM-45643__b1539010273495">Job Management</strong>, click <strong id="ALM-45643__b12390172774919">More</strong> in the <strong id="ALM-45643__b5390427124911">Operation</strong> column, and select <strong id="ALM-45643__b153901127144910">Job Monitoring</strong> to view TaskManager logs.</p>
|
|
</div></div>
|
|
</li></ol>
|
|
</p></li></ol>
|
|
<p id="ALM-45643__p970185154218"><strong id="ALM-45643__b2140752563">If logs are unavailable on the Yarn page, download logs from HDFS.</strong></p>
|
|
<ol start="6" id="ALM-45643__ol158315102428"><li id="ALM-45643__li382131054218"><a name="ALM-45643__li382131054218"></a><a name="li382131054218"></a><span>On Manager, choose <strong id="ALM-45643__b111754171566">Cluster</strong> > <strong id="ALM-45643__b1317519174567">Services</strong> > <strong id="ALM-45643__b1117511713565">HDFS</strong>, click the link next to <strong id="ALM-45643__b18176317125617">NameNode WebUI</strong> to go to the HDFS page, choose <strong id="ALM-45643__b51766179565">Utilities</strong> > <strong id="ALM-45643__b6176617155618">Browse the file system</strong>, and download logs in the <strong id="ALM-45643__b41761117105615">/tmp/logs/</strong><em id="ALM-45643__i21766176567">Username</em><strong id="ALM-45643__b13177317205614">/bucket-logs-tfile/</strong><em id="ALM-45643__i1917751765616">Last four digits of the task application ID/Application ID of the task</em> directory.</span></li></ol>
|
|
<p id="ALM-45643__p1059864934216"><strong id="ALM-45643__b11137174714426">Check whether the write pressure of RocksDB is high.</strong></p>
|
|
<ol start="7" id="ALM-45643__ol14908105315427"><li id="ALM-45643__li9907135324219"><a name="ALM-45643__li9907135324219"></a><a name="li9907135324219"></a><span>Check whether the value of <strong id="ALM-45643__b1477102314118">rocksdb.size-all-mem-tables</strong> (unit: byte) in the TaskManager monitoring logs (keyword <strong id="ALM-45643__b14546194431120">RocksDBMetricPrint</strong>) is greater than or equal to the total write buffer size (Total write buffer = <strong id="ALM-45643__b1547172118124">write_buffer_size</strong> x <strong id="ALM-45643__b1584172715125">max_write_buffer_number</strong>).</span><p><ul id="ALM-45643__ul159078534421"><li id="ALM-45643__li590765364219">If yes, adjust the values of the following custom parameters on the job development page of the Flink web UI, save the settings, and go to <a href="#ALM-45643__li490825314219">8</a>.
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45643__table742562018424" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Custom parameters</caption><thead align="left"><tr id="ALM-45643__row1742622017427"><th align="left" class="cellrowborder" valign="top" width="28.110000000000003%" id="mcps1.3.7.8.1.2.1.1.2.2.4.1.1"><p id="ALM-45643__p7426020114213">Parameter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="32.98%" id="mcps1.3.7.8.1.2.1.1.2.2.4.1.2"><p id="ALM-45643__p242672034215">Default Value</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="38.91%" id="mcps1.3.7.8.1.2.1.1.2.2.4.1.3"><p id="ALM-45643__p174261820114215">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45643__row124278206422"><td class="cellrowborder" valign="top" width="28.110000000000003%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.1 "><p id="ALM-45643__p1442762074212">state.backend.rocksdb.writebuffer.count</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="32.98%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.2 "><ul id="ALM-45643__ul16935115111018"><li id="ALM-45643__li159351851608"><strong id="ALM-45643__b942414816217">2</strong></li><li id="ALM-45643__li2093517511709"><strong id="ALM-45643__b18682812192119">4</strong>: enables <strong id="ALM-45643__b0693143712217">SPINNING_DISK_OPTIMIZED_HIGH_MEM</strong>.</li></ul>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.91%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.3 "><ul id="ALM-45643__ul516413118114"><li id="ALM-45643__li131641531314">Number of buffers</li><li id="ALM-45643__li18164133114112"><strong id="ALM-45643__b82674220225">2</strong> to <strong id="ALM-45643__b12660194313228">10</strong> are recommended. Adjust the value based on service requirements.</li></ul>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45643__row2042762011422"><td class="cellrowborder" valign="top" width="28.110000000000003%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.1 "><p id="ALM-45643__p642718203428">state.backend.rocksdb.writebuffer.size</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="32.98%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.2 "><p id="ALM-45643__p842752014420"><strong id="ALM-45643__b16172154172215">64MB</strong></p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.91%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.3 "><ul id="ALM-45643__ul1913015508511"><li id="ALM-45643__li15130165015517">Buffer size</li><li id="ALM-45643__li1390194825710"><strong id="ALM-45643__b33211412313">64MB</strong> to <strong id="ALM-45643__b139868912231">256MB</strong> are recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45643__row12427122011422"><td class="cellrowborder" valign="top" width="28.110000000000003%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.1 "><p id="ALM-45643__p114271120194211">state.backend.rocksdb.thread.num</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="32.98%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.2 "><ul id="ALM-45643__ul592313441413"><li id="ALM-45643__li69236441045"><strong id="ALM-45643__b31651144473">2</strong></li><li id="ALM-45643__li159231644247"><strong id="ALM-45643__b12977164515716">4</strong>: enables <strong id="ALM-45643__b1897874518710">SPINNING_DISK_OPTIMIZED_HIGH_MEM</strong>.</li></ul>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.91%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.3 "><ul id="ALM-45643__ul01721453853"><li id="ALM-45643__li217318532515">Number of flush threads. Increase the number of threads to quickly flush memory data to disks.</li><li id="ALM-45643__li1577014227598">When the number of threads is increased, the number of vCores also needs to be increased.</li><li id="ALM-45643__li2779182219585"><strong id="ALM-45643__b354420468811">2</strong> to <strong id="ALM-45643__b0544194616814">10</strong> are recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</li></ul>
|
|
<ul id="ALM-45643__ul49074533427"><li id="ALM-45643__li1190735310428">If no, go to <a href="#ALM-45643__li390818530424">9</a>.</li></ul>
|
|
</p></li><li id="ALM-45643__li490825314219"><a name="ALM-45643__li490825314219"></a><a name="li490825314219"></a><span>Restart the job and check whether the alarm is cleared.</span><p><ul id="ALM-45643__ul49081553124216"><li id="ALM-45643__li090818532424">If yes, no further action is required.</li><li id="ALM-45643__li199081953104216">If no, go to <a href="#ALM-45643__li390818530424">9</a>.</li></ul>
|
|
</p></li><li id="ALM-45643__li390818530424"><a name="ALM-45643__li390818530424"></a><a name="li390818530424"></a><span>Contact <span id="ALM-45643__text390885334214">O&M personnel</span> and send the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-45643__section169311343318"><h4 class="sectiontitle"><span id="ALM-45643__text195945622616">Alarm Clearance</span></h4><p id="ALM-45643__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45643__section4139237"><h4 class="sectiontitle"><span id="ALM-45643__text143698488285">Related Information</span></h4><p id="ALM-45643__p33559471"><span id="ALM-45643__text19275105817121">None.</span></p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|