doc-exports/docs/mrs/component-operation-guide/mrs_01_2054.html

<a name="mrs_01_2054"></a><a name="mrs_01_2054"></a>

<h1 class="topictitle1">Why Is the Input Size Corresponding to Batch Time on the Web UI Set to 0 Records When Kafka Is Restarted During Spark Streaming Running?</h1>
<div id="body1595920224704"><div class="section" id="mrs_01_2054__sc68730c04c314e16809af832f9dc293e"><h4 class="sectiontitle">Question</h4><p id="mrs_01_2054__a61fc317a43994dabbdf447ba921fce03">When the Kafka is restarted during the execution of the Spark Streaming application, the application cannot obtain the topic offset from the Kafka. As a result, the job fails to be generated. As shown in <a href="#mrs_01_2054__f4c352ef623b7496485289010f2d69e3c">Figure 1</a>, <strong id="mrs_01_2054__b72611444243">2017/05/11 10:57:00-2017/05/11 10:58:00</strong> indicates the Kafka restart time. After the restart is successful at 10:58:00 on May,11,2017, the value of <span class="parmname" id="mrs_01_2054__p9854ca04be494cf4b5ed97f16ad1d7b8"><b>Input Size</b></span> is <span class="parmvalue" id="mrs_01_2054__pa23acd567387401499e3f8acc8aa2f6a"><b>0 records</b></span>.</p>
<div class="fignone" id="mrs_01_2054__f4c352ef623b7496485289010f2d69e3c"><a name="mrs_01_2054__f4c352ef623b7496485289010f2d69e3c"></a><a name="f4c352ef623b7496485289010f2d69e3c"></a><span class="figcap"><b>Figure 1 </b>On the Web UI, the <strong id="mrs_01_2054__b4528183217243">input size</strong> corresponding to the <strong id="mrs_01_2054__b115295321242">batch time</strong> is <strong id="mrs_01_2054__b85291832122415">0 records</strong>.</span><br><span><img id="mrs_01_2054__iab704267619c46e9b744388d84e28e6a" src="en-us_image_0000001349090029.png"></span></div>
</div>
<div class="section" id="mrs_01_2054__s1ee317fb4f724a39ac5f90603b3443ea"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_2054__af3fc46e43bf04928ac071a40f396cefd">After Kafka is restarted, the application supplements the missing RDD between 10:57:00 on May 11, 2017 and 10:58:00 on May 11, 2017 based on the batch time. Although the number of read data records displayed on the UI is <span class="parmvalue" id="mrs_01_2054__p825425acb5e4436f929b300b8071f4b1"><b>0</b></span>, the missing data is processed in the supplemented RDD. Therefore, no data loss occurs.</p>
<p id="mrs_01_2054__a828ff2f9f5c14332ae40943104155fa5">The data processing mechanism during the Kafka restart period is as follows:</p>
<p id="mrs_01_2054__aaa97fef600fd46dc96343d956b54ef0d">The Spark Streaming application uses the <strong id="mrs_01_2054__b581319443282">state</strong> function (for example, <strong id="mrs_01_2054__b19818194419289">updateStateByKey</strong>). After Kafka is restarted, the Spark Streaming application generates a batch task at 10:58:00 on May 11, 2017. The missing RDD between10:57:00 on May 11, 2017 and 10:58:00 on May 11, 2017 is supplemented based on the batch time (data that is not read in Kafka before Kafka restart, which belongs to the batch before 10:57:00 on May 11, 2017).</p>
<p id="mrs_01_2054__aa59d2c48585447e6a045edaecb034966"></p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_2048.html">Spark Streaming</a></div>
</div>
</div>