forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
37 lines
3.6 KiB
HTML
37 lines
3.6 KiB
HTML
<a name="mrs_01_2030"></a><a name="mrs_01_2030"></a>
|
|
|
|
<h1 class="topictitle1">Why Are Some Partitions Empty During Repartition?</h1>
|
|
<div id="body1595920221125"><div class="section" id="mrs_01_2030__s6fca241061f5434981f06822030fb6fa"><h4 class="sectiontitle">Question</h4><p id="mrs_01_2030__a7ab679355d9f4360a8c550253ae6581e">During the repartition operation, the number of blocks (<span class="parmname" id="mrs_01_2030__parmname26141708393529"><b>spark.sql.shuffle.partitions</b></span>) is set to 4,500, and the number of keys used by repartition exceeds 4,000. It is expected that data corresponding to different keys can be allocated to different partitions. However, only 2,000 partitions have data, and data corresponding to different keys is allocated to the same partition.</p>
|
|
</div>
|
|
<div class="section" id="mrs_01_2030__s69d8a52b22254013880517ed653e61b1"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_2030__a1e3a738a4ba347489a07ecf031fea753">This is normal.</p>
|
|
<p id="mrs_01_2030__aa94112854a8640c5b2ba3bc2aba377f1">The partition to which data is distributed is obtained by performing a modulo operation on hashcode of a key. Different hashcodes may have the same modulo result. In this case, data is distributed to the same partition, as a result, some partitions do not have data, and some partitions have data corresponding to multiple keys.</p>
|
|
<p id="mrs_01_2030__abde1ab966f2b412a82360afec4c93491">You can adjust the value of <span class="parmname" id="mrs_01_2030__p5718ff97c51349d69b46f9426202d0b9"><b>spark.sql.shuffle.partitions</b></span> to adjust the cardinality during modulo operation and improve the unevenness of data blocks. After multiple verifications, it is found that the effect is good when the parameter is set to a prime number or an odd number.</p>
|
|
<p id="mrs_01_2030__a3036d0345d9b4ac49e7f822a23eea31a">Configure the following parameters in the <strong id="mrs_01_2030__b210707998493529">spark-defaults.conf</strong> file on the Driver client.</p>
|
|
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_2030__ta169487b849d46bc8e9383caf36c228b" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameter Description</caption><thead align="left"><tr id="mrs_01_2030__r33329d34f3224d768c12a309b29f8fc5"><th align="left" class="cellrowborder" valign="top" width="33.589999999999996%" id="mcps1.3.2.6.2.4.1.1"><p id="mrs_01_2030__a85f334c2630d4ab493974f4291ba8e1e">Parameter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="37.9%" id="mcps1.3.2.6.2.4.1.2"><p id="mrs_01_2030__ae4ef405eae994542b3908055be8d55f2">Description</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="28.51%" id="mcps1.3.2.6.2.4.1.3"><p id="mrs_01_2030__a8318828b751c43bcb2fabc39a6302cb5">Default Value</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="mrs_01_2030__rf9136fffabd747d882b5e6c00a29512f"><td class="cellrowborder" valign="top" width="33.589999999999996%" headers="mcps1.3.2.6.2.4.1.1 "><p id="mrs_01_2030__a63bddec3a424424c95648237ccab07dd">spark.sql.shuffle.partitions</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="37.9%" headers="mcps1.3.2.6.2.4.1.2 "><p id="mrs_01_2030__ab0c29d3ec56b4498a542500a08ed92a0">Number of shuffle data blocks during the shuffle operation.</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="28.51%" headers="mcps1.3.2.6.2.4.1.3 "><p id="mrs_01_2030__a68066d157b7f429a9f6e49eb22c1821c">200</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_2022.html">Spark SQL and DataFrame</a></div>
|
|
</div>
|
|
</div>
|
|
|