Yang, Tong 6182f91ba8 MRS component operation guide_normal 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-09 14:55:21 +00:00

45 lines
4.3 KiB
HTML

<a name="mrs_01_1988"></a><a name="mrs_01_1988"></a>
<h1 class="topictitle1">Optimizing Spark SQL Performance in the Small File Scenario</h1>
<div id="body1595920218429"><div class="section" id="mrs_01_1988__s2fd56ad0027a4d3a879d6278b01de72b"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_1988__a3b453994cae749fa94f748b0b71b32ac">A Spark SQL table may have many small files (far smaller than an HDFS block), each of which maps to a partition on the Spark by default. In other words, each small file is a task. If the small files are great in number, Spark must initiate a large number of tasks. If shuffle operations exist in Spark SQL, the number of hash buckets increases, affecting performance.</p>
<p id="mrs_01_1988__a1e49a7fc6511478abf5b98df1596162b">In this scenario, you can manually specify the split size of each task to avoid an excessive number of tasks and improve performance.</p>
<div class="note" id="mrs_01_1988__need28b6a63be4e329acbd390579d03f2"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_1988__ac19f33d01cfa46e98adf0a6662db3fc2">If the SQL logic does not involve shuffle operations, this optimization does not improve performance.</p>
</div></div>
</div>
<div class="section" id="mrs_01_1988__s844906550410480ca86f1727c3c91e50"><h4 class="sectiontitle">Configuration</h4><p id="mrs_01_1988__a44f57251241d4e19a8581319184cf8ca">If you want to enable small file optimization, configure the <strong id="mrs_01_1988__b656773384016">spark-defaults.conf</strong> file on the Spark client.</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_1988__t858790d4a9324f15885c2d2c0b223a54" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameter description</caption><thead align="left"><tr id="mrs_01_1988__r3ba3c9e1c4f442ceb55b35a976ae66b2"><th align="left" class="cellrowborder" valign="top" width="21.26%" id="mcps1.3.2.3.2.4.1.1"><p id="mrs_01_1988__acb65cfd331c1452cab1d987d8e8f1973">Parameter</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="66.47999999999999%" id="mcps1.3.2.3.2.4.1.2"><p id="mrs_01_1988__af584a20685c5480398b23e7e008b9a1d">Description</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="12.26%" id="mcps1.3.2.3.2.4.1.3"><p id="mrs_01_1988__a54e4b9a257d9455d82a032583f24ba86">Default Value</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_1988__r862d1f6bcd024e2ba47d69ace17ffe90"><td class="cellrowborder" valign="top" width="21.26%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1988__a0f0971d3e5af40bcb96da170b84f18f7">spark.sql.files.maxPartitionBytes</p>
</td>
<td class="cellrowborder" valign="top" width="66.47999999999999%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1988__aec8d0a215205417989c94fe8e5daf480">The maximum number of bytes that can be packed into a single partition when a file is read.</p>
<p id="mrs_01_1988__a19b8b26c2a6644b1aea33e8b24bf4d1c">Unit: byte</p>
</td>
<td class="cellrowborder" valign="top" width="12.26%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1988__ac449f685b84e492ea5a7389a84530c19">134217728 (128 MB)</p>
</td>
</tr>
<tr id="mrs_01_1988__rb17dddae024e42a7bd8354852507dc7f"><td class="cellrowborder" valign="top" width="21.26%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1988__ad91997b89ab24e13ba2978c4bcc79499"><span id="mrs_01_1988__p2e963c6e607c4d948d7172ee1841db6c">spark.files.openCostInBytes</span></p>
</td>
<td class="cellrowborder" valign="top" width="66.47999999999999%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1988__a047baa98ee5843268c2c514b2fa2fa33"><span id="mrs_01_1988__ph14341145065320">The estimated cost to open a file, measured by the number of bytes that can be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimate, then the partitions with small files will be faster than partitions with larger files.</span></p>
</td>
<td class="cellrowborder" valign="top" width="12.26%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1988__a086e9d5010e442dca2f8c0ea4fb654d8"><span id="mrs_01_1988__pa47960dafff548a59a90cc6edff76484">4 MB</span></p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1985.html">Spark SQL and DataFrame Tuning</a></div>
</div>
</div>