doc-exports/docs/mrs/component-operation-guide/mrs_01_1987.html

<a name="mrs_01_1987"></a><a name="mrs_01_1987"></a>

<h1 class="topictitle1">Improving Spark SQL Calculation Performance Under Data Skew</h1>
<div id="body1595920218427"><div class="section" id="mrs_01_1987__s48ecc6f68c104b189d3d24e0055f703e"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_1987__ae170713b63ca454db4264308d55554ca">When multiple tables are joined in Spark SQL, skew occurs in join keys and the data volume in some Hash buckets is much higher than that in other buckets. As a result, some tasks with a large amount of data run slowly, resulting low computing performance. Other tasks with a small amount of data are quickly completed, which frees many CPUs and results in a waste of CPU resources.</p>
<p id="mrs_01_1987__ad31883d5f76c43fc98b0e59ec4c78831">If the automatic data skew function is enabled, data that exceeds the bucketing threshold is bucketed. Multiple tasks proceed data in one bucket. Therefore, CUP usage is enhanced and the system performance is improved.</p>
<div class="note" id="mrs_01_1987__nfc19a9710d4948738ce2134cd8eae99f"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_1987__a7121b75c23014576a653878ece6731e4">Data that has no skew is bucketed and run in the original way.</p>
</div></div>
<p id="mrs_01_1987__a07ecd141d61e41e0a93b9ba5a5f040d4">Restrictions:</p>
<ul id="mrs_01_1987__u0ec9ed84dcc34e3694cd9f75a2a9c577"><li id="mrs_01_1987__lb714e01067e44e429f3a5efa1b32245d">Only the join between two tables is supported.</li><li id="mrs_01_1987__l119f10f660894882a900aa22c2cc1802">FULL OUTER JOIN data does not support data skew.<p id="mrs_01_1987__a4215180e46d04267903157b7b7354f33"><a name="mrs_01_1987__l119f10f660894882a900aa22c2cc1802"></a><a name="l119f10f660894882a900aa22c2cc1802"></a>For example, the following SQL statement indicates that the skew of table <strong id="mrs_01_1987__b7578857173918">a</strong> or table <strong id="mrs_01_1987__b205831157163918">b</strong> cannot trigger the optimization. </p>
<p id="mrs_01_1987__a36d927212c894162ad8774fe14a1fd40"><i><b><span class="cmdname" style="font-family:Arial" id="mrs_01_1987__cea5ce5d536f34ee4b3aa3636ddc10fbe">select aid FROM a FULL OUTER JOIN b ON aid=bid;</span></b></i></p>
</li><li id="mrs_01_1987__l40f4ab9a72e048f0bd69768dfe5e7cbe">LEFT OUTER JOIN data does not support the data skew of the right table.<p id="mrs_01_1987__a878915f2f6a440ae8f4d7a1a85f88e17"><a name="mrs_01_1987__l40f4ab9a72e048f0bd69768dfe5e7cbe"></a><a name="l40f4ab9a72e048f0bd69768dfe5e7cbe"></a>For example, the following SQL statement indicates that the skew of table <strong id="mrs_01_1987__b157391910114013">b</strong> cannot trigger the optimization. </p>
<p id="mrs_01_1987__ae9e709962a184c6ebd3bce4e5388b09f"><i><b><span class="cmdname" style="font-family:Arial" id="mrs_01_1987__c572e9ca847d64a65988c32087e5b0046">select aid FROM a LEFT OUTER JOIN b ON aid=bid;</span></b></i></p>
</li><li id="mrs_01_1987__lcaa092e2904e4f74969d0140db3e2693">RIGHT OUTER JOIN does not support the data skew of the left table.<p id="mrs_01_1987__aeec1cc34a5864302ae7a1e2f25140d44"><a name="mrs_01_1987__lcaa092e2904e4f74969d0140db3e2693"></a><a name="lcaa092e2904e4f74969d0140db3e2693"></a>For example, the following SQL statement indicates that the skew of table <strong id="mrs_01_1987__b5559112054017">a</strong> cannot trigger the optimization. </p>
<p id="mrs_01_1987__a7b6b29a58fc64c0aab4191920b9d92bc"><i><b><span class="cmdname" style="font-family:Arial" id="mrs_01_1987__cdb5a1861783a4b3f995725f459bf0728">select aid FROM a RIGHT OUTER JOIN b ON aid=bid;</span></b></i></p>
</li></ul>
</div>
<div class="section" id="mrs_01_1987__sbe278eb74aae44dd8591a8c2ae7ca571"><h4 class="sectiontitle">Configuration Description</h4><p id="mrs_01_1987__a6c7ae09411874f84b1f32365fa4dccca">Add the following parameters in the following table to the <span class="filepath" id="mrs_01_1987__fdaf870508546484b8105bb2af7d35aa9"><b>spark-defaults.conf</b></span> configuration file on the Spark driver.</p>

<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_1987__t49ec7613cfe54495a049b0525719367f" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameter description</caption><thead align="left"><tr id="mrs_01_1987__rde1ce0798b2d4f09b00036b48806cab7"><th align="left" class="cellrowborder" valign="top" width="24.8%" id="mcps1.3.2.3.2.4.1.1"><p id="mrs_01_1987__ab645abe8dc314a2dadf03f8aa279386e"><strong id="mrs_01_1987__b16351153734014">Parameter</strong></p>
</th>
<th align="left" class="cellrowborder" valign="top" width="64.86%" id="mcps1.3.2.3.2.4.1.2"><p id="mrs_01_1987__a5960fe6211634b6aab87e57010a23942"><strong id="mrs_01_1987__b969017385400">Description</strong></p>
</th>
<th align="left" class="cellrowborder" valign="top" width="10.34%" id="mcps1.3.2.3.2.4.1.3"><p id="mrs_01_1987__a712b4e7ddc5449b1a0929bf5e5d1b571"><strong id="mrs_01_1987__b159232397409">Default Value</strong></p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_1987__r000ee61156af4dbf98994038de1477e6"><td class="cellrowborder" valign="top" width="24.8%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1987__ac6d3382c42554c56b9232504e5a0583f">spark.sql.adaptive.enabled</p>
</td>
<td class="cellrowborder" valign="top" width="64.86%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1987__a68451bd02b164696acb17735e3c641db">The switch to enable the adaptive execution feature.</p>
<p id="mrs_01_1987__p17918584318">Note: If AQE and Static Partition Pruning (DPP) are enabled at the same time, DPP takes precedence over AQE during SparkSQL task execution. As a result, AQE does not take effect. The DPP in the cluster is enabled by default. Therefore, you need to disable it when enabling the AQE.</p>
</td>
<td class="cellrowborder" valign="top" width="10.34%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1987__a71444c4db18842ccb100761ecf4f2977">false</p>
</td>
</tr>
<tr id="mrs_01_1987__row483253443915"><td class="cellrowborder" valign="top" width="24.8%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1987__p108323343396">spark.sql.optimizer.dynamicPartitionPruning.enabled</p>
</td>
<td class="cellrowborder" valign="top" width="64.86%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1987__p883223493914">The switch to enable DPP.</p>
</td>
<td class="cellrowborder" valign="top" width="10.34%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1987__p1083223412392">true</p>
</td>
</tr>
<tr id="mrs_01_1987__row14974153922417"><td class="cellrowborder" valign="top" width="24.8%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1987__p6974123913242">spark.sql.adaptive.skewJoin.enabled</p>
</td>
<td class="cellrowborder" valign="top" width="64.86%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1987__p14615111220">Specifies whether to enable the function of automatic processing of the data skew in join operations. The function is enabled when this parameter is set to <strong id="mrs_01_1987__b15388115577">true</strong> and <strong id="mrs_01_1987__b3384145718">spark.sql.adaptive.enabled</strong> is set to <strong id="mrs_01_1987__b73871175719">true</strong>.</p>
</td>
<td class="cellrowborder" valign="top" width="10.34%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1987__p1597416394244">true</p>
</td>
</tr>
<tr id="mrs_01_1987__row101471616113511"><td class="cellrowborder" valign="top" width="24.8%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1987__p13147101623511">spark.sql.adaptive.skewJoin.skewedPartitionFactor</p>
</td>
<td class="cellrowborder" valign="top" width="64.86%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1987__p34791114215">This parameter is a multiplier used to determine whether a partition is a data skew partition. If the data size of a partition exceeds the value of this parameter multiplied by the median of the all partition sizes except this partition and exceeds the value of <strong id="mrs_01_1987__b177329303245754">spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes</strong>, this partition is considered as a data skew partition.</p>
</td>
<td class="cellrowborder" valign="top" width="10.34%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1987__p11471916163512">5</p>
</td>
</tr>
<tr id="mrs_01_1987__r3a913f8d84974d2ab4f084da83b3b8cc"><td class="cellrowborder" valign="top" width="24.8%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1987__aa0f95b4a8fba4b8894d23196ee267b0c">spark.sql.adaptive.skewjoin.skewedPartitionThresholdInBytes</p>
</td>
<td class="cellrowborder" valign="top" width="64.86%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1987__p38751086293">If the partition size (unit: byte) is greater than the threshold as well as the product of the <strong id="mrs_01_1987__b332795034211">spark.sql.adaptive.skewJoin.skewedPartitionFactor</strong> value and the median partition size, skew occurs in the partition. Ideally, the value of this parameter should be greater than that of <strong id="mrs_01_1987__b38020894320">spark.sql.adaptive.advisoryPartitionSizeInBytes.</strong>.</p>
</td>
<td class="cellrowborder" valign="top" width="10.34%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1987__a439f21f1f8a644f095bb057c6534a6e2">256MB</p>
</td>
</tr>
<tr id="mrs_01_1987__rf7fd716e99314bb99cb5db373263a0c8"><td class="cellrowborder" valign="top" width="24.8%" headers="mcps1.3.2.3.2.4.1.1 "><p id="mrs_01_1987__ae2b56c2b191748daa6b580862e0e9127">spark.sql.adaptive.shuffle.targetPostShuffleInputSize</p>
</td>
<td class="cellrowborder" valign="top" width="64.86%" headers="mcps1.3.2.3.2.4.1.2 "><p id="mrs_01_1987__a9108f042b28d403487fe40c6383846b9">Minimum amount of shuffle data processed by each task. The unit is byte.</p>
</td>
<td class="cellrowborder" valign="top" width="10.34%" headers="mcps1.3.2.3.2.4.1.3 "><p id="mrs_01_1987__a55cbbb1a53b54a48b4e7703d4400bbcc">67108864</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1985.html">Spark SQL and DataFrame Tuning</a></div>
</div>
</div>