doc-exports/docs/mrs/component-operation-guide/mrs_01_2084.html

<a name="mrs_01_2084"></a><a name="mrs_01_2084"></a>

<h1 class="topictitle1">Why Does the Switchover of ResourceManager Occur Continuously?</h1>
<div id="body1596167575153"><div class="section" id="mrs_01_2084__s726b5a5b9ad342f5b4660b45c8d7612f"><h4 class="sectiontitle">Question</h4><p id="mrs_01_2084__a73b0f8c6d5194e858e80fa3a6fa53535">The switchover of ResourceManager occurs continuously when multiple, for example 2,000, tasks are running concurrently, causing the Yarn service unavailable.</p>
</div>
<div class="section" id="mrs_01_2084__s2bd8969538e845cdbb4e63427e310d17"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_2084__af03a4258bc7f4796bcbe01a34e0ca042">The cause is that the time of full GabageCollection exceeds the interaction duration threshold between the ResourceManager and ZooKeeper duration threshold. As a result, the connection between the ResourceManager and ZooKeeper fails and the switchover of ResourceManager occurs continuously.</p>
<p id="mrs_01_2084__af3dfc3fd86284a4ca770fba86b9cfcf7">When there are multiple tasks, ResourceManager saves the authentication information about multiple tasks and transfers the information to NodeManagers through heartbeat, which is called heartbeat response. The lifecycle of heartbeat response is short. The default value is 1s. Normally, heartbeat response can be reclaimed during the JVM minor GabageCollection. However, if there are multiple tasks and there are a lot of nodes, for example 5000 nodes, in the cluster, the heartbeat response of multiple nodes occupy a large amount of memory. As a result, the JVM cannot completely reclaim the heartbeat response during minor GabageCollection. The heartbeat response failed to be reclaimed accumulate and the JVM full GabageCollection is triggered. The JVM GabageCollection is in a blocking mode, in other words, no jobs are performed during the GabageCollection. Therefore, if the duration of full GabageCollection exceeds the periodical interaction duration threshold between the ResourceManager and ZooKeeper, the switchover occurs.</p>
<p id="mrs_01_2084__aafee3834b4bb4ab582d4fcc9731cd36d">Log in to FusionInsight Manager, choose <strong id="mrs_01_2084__b1745312818479">Cluster</strong> &gt; <strong id="mrs_01_2084__b1245322811472">Services</strong> &gt; <strong id="mrs_01_2084__b154533285471">Yarn</strong>, and click the <strong id="mrs_01_2084__b17454228114718">Configurations</strong> tab and then <strong id="mrs_01_2084__b9454162834711">All Configurations</strong>. In the navigation pane on the left, choose <strong id="mrs_01_2084__b13454142864711">Yarn </strong>&gt; <strong id="mrs_01_2084__b1745492813471">Customization</strong>, and add the <span class="parmname" id="mrs_01_2084__parmname1845462894716"><b>yarn.resourcemanager.zk-timeout-ms</b></span> parameter to the <span class="filepath" id="mrs_01_2084__filepath145422844712"><b>yarn.yarn-site.customized.configs</b></span> file to increase the threshold of the periodic interaction duration between ResourceManager and ZooKeeper (the value range is less than or equal to 90,000 ms). In this way, the problem of continuous active/standby ResourceManager switchover can be solved.</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_2077.html">Common Issues About Yarn</a></div>
</div>
</div>