doc-exports/docs/mrs/component-operation-guide/mrs_01_1995.html

<a name="mrs_01_1995"></a><a name="mrs_01_1995"></a>

<h1 class="topictitle1">Optimizing Small Files</h1>
<div id="body1595920218805"><div class="section" id="mrs_01_1995__s883b8dce237248768cdae0555b5ca297"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_1995__a66e4075625884a748136a2fd76982aa6">A Spark SQL table may have many small files (far smaller than an HDFS block), each of which maps to a partition on the Spark by default. In other words, each small file is a task. In this way, Spark has to start many such tasks. If a shuffle operation is involved in the SQL logic, the number of hash buckets soars, severely hindering system performance.</p>
<p id="mrs_01_1995__a902115f15b1d42c6967c7a0125902213">In case of massive number of small files, when DataSource creates an RDD, it splits small files in the Spark SQL table to PartitionedFiles and then merges the PartitionedFiles to a partition to avoid generating too many hash buckets during the shuffle operation. See <a href="#mrs_01_1995__fc95571bfb3be4f21b9f7dbcdcf493ebb">Figure 1</a>.</p>
<div class="fignone" id="mrs_01_1995__fc95571bfb3be4f21b9f7dbcdcf493ebb"><a name="mrs_01_1995__fc95571bfb3be4f21b9f7dbcdcf493ebb"></a><a name="fc95571bfb3be4f21b9f7dbcdcf493ebb"></a><span class="figcap"><b>Figure 1 </b>Merging small files</span><br><span><img id="mrs_01_1995__i0aa5cb65cab540118a70f11982c04e7d" src="en-us_image_0000001349170393.jpg"></span></div>
</div>
<div class="section" id="mrs_01_1995__sce4aa1c5e7bb40b4b4d84b9e5b8f66e1"><h4 class="sectiontitle">Procedure</h4><p id="mrs_01_1995__a50e74d62af4b4b87aceb7b4c68e285f9">If you want to enable small file optimization, configure the <strong id="mrs_01_1995__b57221718174219">spark-defaults.conf</strong> file on the Spark client.</p>

<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_1995__t8b6506cddde94980b93aefcbf127964f" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameter description</caption><thead align="left"><tr id="mrs_01_1995__rf73156e0f9c24416a394bd0f00bfcffa"><th align="left" class="cellrowborder" valign="top" width="36.4%" id="mcps1.3.2.3.2.4.1.1"><p id="mrs_01_1995__a63802da194a4424096e72d3f2c5bb49e">Parameter</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50.88%" id="mcps1.3.2.3.2.4.1.2"><p id="mrs_01_1995__a0edadaa277b74b4badb910e9ae827959">Description</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="12.72%" id="mcps1.3.2.3.2.4.1.3"><p id="mrs_01_1995__a4a40b4885c584f31944b961593985d74">Default Value</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_1995__r94404f5076504d62b57749aec6967247"><td class="cellrowborder" valign="top" width="36.4%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1995__a508f66f77a78479c9375caa4ba4a3189">spark.sql.files.maxPartitionBytes</p>
</td>
<td class="cellrowborder" valign="top" width="50.88%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1995__a98c5455bf16d44d6be51421a44abc664">The maximum number of bytes that can be packed into a single partition when a file is read.</p>
<p id="mrs_01_1995__a0786b4434b50462c839e6d01aafd96b8">Unit: byte</p>
</td>
<td class="cellrowborder" valign="top" width="12.72%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1995__a0322e17d0dba4c4f899f6a439167029c">134217728 (128 MB)</p>
</td>
</tr>
<tr id="mrs_01_1995__rfeee362c797d4609a48a880e1539a507"><td class="cellrowborder" valign="top" width="36.4%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1995__a876d5ea9c3c547649071012e763c16fa"><span id="mrs_01_1995__pd23cbe816ebf4fd993272ef84c5f859e">spark.files.openCostInBytes</span></p>
</td>
<td class="cellrowborder" valign="top" width="50.88%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1995__af1abe43ea676424d8c57a21417e029b9"><span id="mrs_01_1995__ph163221249185419">The estimated cost to open a file, measured by the number of bytes that can be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimate, then the partitions with small files will be faster than partitions with larger files.</span></p>
</td>
<td class="cellrowborder" valign="top" width="12.72%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1995__a436ec817477d4a2bb30c16df7f30fc58"><span id="mrs_01_1995__ph19724171775512">4 MB</span></p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1985.html">Spark SQL and DataFrame Tuning</a></div>
</div>
</div>