doc-exports/docs/mrs/component-operation-guide/mrs_01_1973.html

<a name="mrs_01_1973"></a><a name="mrs_01_1973"></a>

<h1 class="topictitle1">Small File Combination Tools</h1>
<div id="body1595920217120"><div class="section" id="mrs_01_1973__section9305115516146"><h4 class="sectiontitle">Tool Overview</h4><p id="mrs_01_1973__p1884214331610">In a large-scale Hadoop production cluster, HDFS metadata is stored in the NameNode memory, and the cluster scale is restricted by the memory limitation of each NameNode. If there are a large number of small files in the HDFS, a large amount of NameNode memory is consumed, which greatly reduces the read and write performance and prolongs the job running time. Based on the preceding information, the small file problem is a key factor that restricts the expansion of the Hadoop cluster.</p>
<p id="mrs_01_1973__p1528914478334">This tool provides the following functions:</p>
<ol id="mrs_01_1973__ol185508543367"><li id="mrs_01_1973__li3550195483613">Checks the number of small files whose size is less than the threshold configured by the user in tables and returns the average size of all data files in the table directory.</li><li id="mrs_01_1973__li6550195403611">Provides the function of combination table files. Users can set the average file size after combination.</li></ol>
</div>
<div class="section" id="mrs_01_1973__section116413207313"><h4 class="sectiontitle">Supported Table Types</h4><p id="mrs_01_1973__p179668186203">Spark: Parquet, ORC, CSV, Text, and Json.</p>
<p id="mrs_01_1973__p8432345141912">Hive: Parquet, ORC, CSV, Text, RCFile, Sequence and Bucket.</p>
<div class="note" id="mrs_01_1973__note858742516213"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ol id="mrs_01_1973__ol5695175281818"><li id="mrs_01_1973__li3695105231810">After tables with compressed data are merged, Spark uses the default compression format Snappy for data compression. You can configure <strong id="mrs_01_1973__b1399310201077">spark.sql.parquet.compression.codec</strong> (available values: <strong id="mrs_01_1973__b1599814201972">uncompressed</strong>, <strong id="mrs_01_1973__b89980202714">gzip</strong>, <strong id="mrs_01_1973__b209995201278">lzo</strong>, and <strong id="mrs_01_1973__b399982016720">snappy</strong>) and <strong id="mrs_01_1973__b1099910201776">spark.sql.orc.compression.codec</strong> (available values: <strong id="mrs_01_1973__b1999020373">uncompressed</strong>, <strong id="mrs_01_1973__b18052110715">zlib</strong>, <strong id="mrs_01_1973__b0010218717">lzo</strong>, and <strong id="mrs_01_1973__b160182114712">snappy</strong>) on the client to select the compression format for the Parquet and ORC tables. Compression formats available for Hive and Spark tables are different, except the preceding compression formats, other compression formats are not supported.</li><li id="mrs_01_1973__li16172019134513">To merge bucket table data, you need to add the following configurations to the <strong id="mrs_01_1973__b1329262317718">hive-site.xml</strong> file on the Spark2x client:<pre class="screen" id="mrs_01_1973__screen1580393045112">&lt;property&gt;
&lt;name&gt;hive.enforce.bucketing&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;hive.enforce.sorting&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/property&gt;</pre>
</li><li id="mrs_01_1973__li4708539114618">Spark does not support the feature of encrypting data columns in Hive.</li></ol>
</div></div>
</div>
<div class="section" id="mrs_01_1973__section1170226171812"><h4 class="sectiontitle">Tool Usage</h4><p id="mrs_01_1973__p5521530203411">Download and install the client. For example, the installation directory is <span class="filepath" id="mrs_01_1973__filepath161417291476"><b>/opt/client</b></span>. Go to <strong id="mrs_01_1973__b205041010987">/opt</strong><strong id="mrs_01_1973__b85041110785"></strong><strong id="mrs_01_1973__b65048101813">/client/Spark2x/spark/bin</strong> and run the <strong id="mrs_01_1973__b11709652476">mergetool.sh</strong> script.</p>
<p id="mrs_01_1973__p0797125834419"><strong id="mrs_01_1973__b850211111583">Environment variables loading</strong></p>
<p id="mrs_01_1973__p16678104810458"><strong id="mrs_01_1973__b3901112034612">source /opt/client/bigdata_env</strong></p>
<p id="mrs_01_1973__p18253732464"><strong id="mrs_01_1973__b19910132017468">source /opt/client/Spark2x/component_env</strong></p>
<p id="mrs_01_1973__p18208113461814"><strong id="mrs_01_1973__b176881814814">Scanning function</strong></p>
<p id="mrs_01_1973__p159799438193">Command: <b><span class="cmdname" id="mrs_01_1973__cmdname62852614203">sh mergetool.sh scan &lt;db.table&gt; &lt;filesize&gt;</span></b></p>
<p id="mrs_01_1973__p10952458161814">The format of <em id="mrs_01_1973__i16845163916820">db.table</em> is <em id="mrs_01_1973__i284514391787">Database name</em>,<em id="mrs_01_1973__i384553911812">Table name</em>. <em id="mrs_01_1973__i1784619391981">filesize</em> is the user-defined threshold of the small file size (unit: MB). The returned result is the number of files that is smaller than the threshold and the average size of data files in the table directory.</p>
<p id="mrs_01_1973__p58421872210">Example: <strong id="mrs_01_1973__b1372441988">sh mergetool.sh scan default.table1 128</strong></p>
<p id="mrs_01_1973__p10666113952112"><strong id="mrs_01_1973__b1599184417814">Combination function</strong></p>
<p id="mrs_01_1973__p173061550132116">Command: <b><span class="cmdname" id="mrs_01_1973__cmdname11957717444">sh mergetool.sh merge &lt;db.table&gt; &lt;filesize&gt; &lt;shuffle&gt;</span></b></p>
<p id="mrs_01_1973__p36353912228">The format of <em id="mrs_01_1973__i35711058985">db.table</em> is <em id="mrs_01_1973__i1057275815818">Database name,Table name</em>. <strong id="mrs_01_1973__b205728588814">filesize</strong> is the user-defined average file size after file combination (unit: MB). <strong id="mrs_01_1973__b11572175814819">shuffle</strong> is a Boolean value, and the value is <strong id="mrs_01_1973__b1572458584">true</strong> or <strong id="mrs_01_1973__b7572195814815">false</strong>, which is used to configure whether to allow data to be shuffled during the merge.</p>
<p id="mrs_01_1973__p565946182316">Example: <strong id="mrs_01_1973__b8435401395">sh mergetool.sh merge default.table1 128 false</strong></p>
<p id="mrs_01_1973__p2219185923317">If the following information is displayed, the operation is successful:</p>
<pre class="screen" id="mrs_01_1973__screen14101571330">SUCCESS: Merge succeeded</pre>
<div class="note" id="mrs_01_1973__n236aa1938a64463b8383ff0d89d0fe2f"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ol id="mrs_01_1973__ol10847161583919"><li id="mrs_01_1973__li1584761563912">Ensure that the current user is the owner of the merged table.</li><li id="mrs_01_1973__li20740179382">Before combination, ensure that HDFS has sufficient storage space, greater than the size of the combined table.</li><li id="mrs_01_1973__li1784831511394">Table data must be combined separately. If a table is read during table data combination, the file may not be found temporarily. After the combination is complete, this problem is resolved. During the combination, do not write data to the corresponding tables. Otherwise, data inconsistency may occur.</li><li id="mrs_01_1973__li18460523584">If an error occurs indicating that the file does not exist when the query of data in a partitioned table is performed on the session spark-beeline/spark-sql that is always in the connected status. You can run the <strong id="mrs_01_1973__b4260298914">refresh table</strong><em id="mrs_01_1973__i62651791794">Table name</em> command as prompted to query the data again.</li><li id="mrs_01_1973__li952213016475">Configure <strong id="mrs_01_1973__b1522615118910">filesize</strong> based on the site requirements. For example, you can set <strong id="mrs_01_1973__b12226101116911">filesize</strong> to a value greater than the average during file merging after obtaining the average file size by file scan. Otherwise, the number of files may increase after the file merging.</li><li id="mrs_01_1973__li17540938171017">During the file merging, data in the original tables is removed to the recycle bin. In the case of any exception occurs on the data after file merging, the data in the original tables is used to replace the damaged data. If an exception occurs during the process, restore the data in the trash directory by using the <strong id="mrs_01_1973__b849113141791">mv</strong> command in HDFS.</li><li id="mrs_01_1973__li1348712215208">In the HDFS router federation scenario, if the target NameService of the table root path is different from that of the root path <strong id="mrs_01_1973__b10321164621915">/user</strong>, you need to manually clear the original table files stored in the recycle bin during the second combination. Otherwise, the combination fails.</li><li id="mrs_01_1973__li19836190911">This tool uses the configuration of the client. Performance optimization can be performed modifying required configuration in the client configuration file.</li></ol>
</div></div>
</div>
<p id="mrs_01_1973__p134384054116"><strong id="mrs_01_1973__b191704170917">shuffle configuration</strong></p>
<p id="mrs_01_1973__p94391308414">For the combination function, you can roughly estimate the change on the number of partitions before and after the combination.</p>
<p id="mrs_01_1973__p643915019413">Generally, if the number of old partitions is greater than the number of new partitions, set <strong id="mrs_01_1973__b1817162017917">shuffle</strong> to <strong id="mrs_01_1973__b10176112019911">false</strong>. However, if the number of old partitions is much greater than that of new partitions (for example, more than 100 times), you can set <strong id="mrs_01_1973__b1117715208910">shuffle</strong> to <strong id="mrs_01_1973__b111779201894">true</strong> to increase the degree of parallelism and improve the combination speed.</p>
<div class="notice" id="mrs_01_1973__note9853163211277"><span class="noticetitle"><img src="public_sys-resources/notice_3.0-en-us.png"> </span><div class="noticebody"><ul id="mrs_01_1973__ul7394184363911"><li id="mrs_01_1973__li15395243143917">If <strong id="mrs_01_1973__b8350122119914">shuffle</strong> is set to <strong id="mrs_01_1973__b53511721190">true</strong> (repartition), the performance is improved. However, due to the particularity of the Parquet and ORC storage modes, repartition will reduce the compression ratio and the total size of the table in HDFS increases by 1.3 times.</li><li id="mrs_01_1973__li2395943133911">If <strong id="mrs_01_1973__b1354462218915">shuffle</strong> is set to <strong id="mrs_01_1973__b165451122295">false</strong> (coalesce), the merged files may have some difference in size, which is close to the value of the configured <strong id="mrs_01_1973__b754519221597">filesize</strong>.</li></ul>
</div></div>
<p id="mrs_01_1973__p577711353617"><strong id="mrs_01_1973__b965914231498">Log storage location</strong></p>
<p id="mrs_01_1973__p1197911332406">The default log storage path is <strong id="mrs_01_1973__b911112254912">/tmp/SmallFilesLog.log4j</strong>. To customize the log storage path, you can configure <strong id="mrs_01_1973__b3112102517912">log4j.appender.logfile.File</strong> in <strong id="mrs_01_1973__b181124251693">/opt/client/Spark2x/spark/tool/log4j.properties</strong>.</p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1926.html">Using Spark2x</a></div>
</div>
</div>