Yang, Tong 6182f91ba8 MRS component operation guide_normal 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-09 14:55:21 +00:00

104 lines
9.4 KiB
HTML

<a name="mrs_01_2031"></a><a name="mrs_01_2031"></a>
<h1 class="topictitle1">Why Does 16 Terabytes of Text Data Fails to Be Converted into 4 Terabytes of Parquet Data?</h1>
<div id="body1595920221127"><div class="section" id="mrs_01_2031__s100984b7ce93405dbe1993c70bf348ed"><h4 class="sectiontitle">Question</h4><p id="mrs_01_2031__a8a5d0c0812bc4f5cb0bf570f3da41f0d">When the default configuration is used, 16 terabytes of text data fails to be converted into 4 terabytes of parquet data, and the error information below is displayed. Why?</p>
<pre class="screen" id="mrs_01_2031__sdf0849c90a184d8b93a4358435258341">Job aborted due to stage failure: Task 2866 in stage 11.0 failed 4 times, most recent failure: Lost task 2866.6 in stage 11.0 (TID 54863, linux-161, 2): java.io.IOException: Failed to connect to /10.16.1.11:23124
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:92)</pre>
<p id="mrs_01_2031__aa6b361cc649347ccaf73cdf041796fcd"><a href="#mrs_01_2031__t11c6ce9d45ea4d69a81893e1f90f35cc">Table 1</a> lists the default configuration.</p>
<div class="tablenoborder"><a name="mrs_01_2031__t11c6ce9d45ea4d69a81893e1f90f35cc"></a><a name="t11c6ce9d45ea4d69a81893e1f90f35cc"></a><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_2031__t11c6ce9d45ea4d69a81893e1f90f35cc" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameter Description</caption><thead align="left"><tr id="mrs_01_2031__r495821a1e8bf4cb7b8a2e95e4b20118d"><th align="left" class="cellrowborder" valign="top" width="31.4%" id="mcps1.3.1.5.2.4.1.1"><p id="mrs_01_2031__a27fc00d6625e400082959e665f120b7e">Parameter</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="48.86%" id="mcps1.3.1.5.2.4.1.2"><p id="mrs_01_2031__ab11d95672135438dbd518a013a2fe7fa">Description</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="19.74%" id="mcps1.3.1.5.2.4.1.3"><p id="mrs_01_2031__a8d2f21b49c9343b0b65a3ffaeed31a29">Default Value</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_2031__rd1d7d5145aa3442088536b37319273b4"><td class="cellrowborder" valign="top" width="31.4%" headers="mcps1.3.1.5.2.4.1.1 "><p id="mrs_01_2031__a534f31173eb647aba170de81842069ec">spark.sql.shuffle.partitions</p>
</td>
<td class="cellrowborder" valign="top" width="48.86%" headers="mcps1.3.1.5.2.4.1.2 "><p id="mrs_01_2031__a0a19682fa8c445a4b318fe16706ff371">Number of shuffle data blocks during the shuffle operation.</p>
</td>
<td class="cellrowborder" valign="top" width="19.74%" headers="mcps1.3.1.5.2.4.1.3 "><p id="mrs_01_2031__a2dd117726fee4405b8977f2bf177948f">200</p>
</td>
</tr>
<tr id="mrs_01_2031__r762e1e8149944c7bbd062f33cc3ad324"><td class="cellrowborder" valign="top" width="31.4%" headers="mcps1.3.1.5.2.4.1.1 "><p id="mrs_01_2031__a7db34485dda6438cbf6a890423881baf">spark.shuffle.sasl.timeout</p>
</td>
<td class="cellrowborder" valign="top" width="48.86%" headers="mcps1.3.1.5.2.4.1.2 "><p id="mrs_01_2031__ad70534097e834c7ba27541af9a0d28a0">Timeout interval of SASL authentication for the shuffle operation. Unit: second</p>
</td>
<td class="cellrowborder" valign="top" width="19.74%" headers="mcps1.3.1.5.2.4.1.3 "><p id="mrs_01_2031__aea98210aad894fc1921474fe7181ad08">120s</p>
</td>
</tr>
<tr id="mrs_01_2031__r93f327468e3e4e7986cb5f404a815f55"><td class="cellrowborder" valign="top" width="31.4%" headers="mcps1.3.1.5.2.4.1.1 "><p id="mrs_01_2031__a9bd95bdeba764a7eb169f9b89223f969">spark.shuffle.io.connectionTimeout</p>
</td>
<td class="cellrowborder" valign="top" width="48.86%" headers="mcps1.3.1.5.2.4.1.2 "><p id="mrs_01_2031__a845725023b164c8c965e5b8c4a2f7126">Timeout interval for connecting to a remote node during the shuffle operation. Unit: second</p>
</td>
<td class="cellrowborder" valign="top" width="19.74%" headers="mcps1.3.1.5.2.4.1.3 "><p id="mrs_01_2031__a1dc4927677aa485c94f835f501b25f6d">120s</p>
</td>
</tr>
<tr id="mrs_01_2031__r8139e5d658a142c7a650bc55ad55b116"><td class="cellrowborder" valign="top" width="31.4%" headers="mcps1.3.1.5.2.4.1.1 "><p id="mrs_01_2031__aa32d8a76e8b7408ebb2c2ace7cd2d56c">spark.network.timeout</p>
</td>
<td class="cellrowborder" valign="top" width="48.86%" headers="mcps1.3.1.5.2.4.1.2 "><p id="mrs_01_2031__a14870bb21aa340a68f8a327d58c32ae3">Timeout interval for all network connection operations. Unit: second</p>
</td>
<td class="cellrowborder" valign="top" width="19.74%" headers="mcps1.3.1.5.2.4.1.3 "><p id="mrs_01_2031__a7b114de4b7ac4513b6afdda3be8358ac">360s</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="mrs_01_2031__sa20d9b7920074ce4b3415205285205c4"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_2031__a8b6bae27aff343e0af81e5b364dffa4b">The current data volume is 16 TB, but the number of partitions is only 200. As a result, each task is overloaded and the preceding problem occurs.</p>
<p id="mrs_01_2031__a8df501a5f8ef4a44b8dfbe9a08cefc14">To solve the preceding problem, you need to adjust the parameters.</p>
<ul id="mrs_01_2031__u79627f6b11884c05aea91f603fe8d170"><li id="mrs_01_2031__l39da046a92d441e796656b334788d93c">Increase the number of partitions to divide the task into smaller ones.</li><li id="mrs_01_2031__ld6652072c621475a8f48307ef5a7c514">Increase the timeout interval during task execution.</li></ul>
</div>
<p id="mrs_01_2031__a42e941072fb14c65845529e65758b87a">Configure the following parameters in the <span class="filepath" id="mrs_01_2031__filepath9592136202613"><b>spark-defaults.conf</b></span> file on the client:</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_2031__tb2a0df6c550745648f13104eb6e0eef0" frame="border" border="1" rules="all"><caption><b>Table 2 </b>Parameter Description</caption><thead align="left"><tr id="mrs_01_2031__rfc083680c7244d5bb2e6ffac0c149a44"><th align="left" class="cellrowborder" valign="top" width="39.756024397560246%" id="mcps1.3.4.2.4.1.1"><p id="mrs_01_2031__affce9dc62cf44d429ca61ee1d3200bbb">Parameter</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="40.29597040295971%" id="mcps1.3.4.2.4.1.2"><p id="mrs_01_2031__a3a9666e0e147433c916c71e1afe15b89">Description</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="19.948005199480054%" id="mcps1.3.4.2.4.1.3"><p id="mrs_01_2031__ad3b01c2056df40d3a4a103068da095db">Recommended Value</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_2031__r1260657ff3534fc49971bb4836ad1c6f"><td class="cellrowborder" valign="top" width="39.756024397560246%" headers="mcps1.3.4.2.4.1.1 "><p id="mrs_01_2031__aaff2cbcc9b1d4a46909750c8335535aa">spark.sql.shuffle.partitions</p>
</td>
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.4.2.4.1.2 "><p id="mrs_01_2031__aaa18f3739e5b437e86c1a1aacac59d25">Number of shuffle data blocks during the shuffle operation.</p>
</td>
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.4.2.4.1.3 "><p id="mrs_01_2031__a3148fc51bf444e33816445403e23403d">4501</p>
</td>
</tr>
<tr id="mrs_01_2031__r11cb34476df24943a737147ee485c692"><td class="cellrowborder" valign="top" width="39.756024397560246%" headers="mcps1.3.4.2.4.1.1 "><p id="mrs_01_2031__a70ebe2fe188f4790b3a2836bfe53a81c">spark.shuffle.sasl.timeout</p>
</td>
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.4.2.4.1.2 "><p id="mrs_01_2031__ac71b68e3e9f04c5f91b402a2acc086e6">Timeout interval of SASL authentication for the shuffle operation. Unit: second</p>
</td>
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.4.2.4.1.3 "><p id="mrs_01_2031__a72345ba79f2a4a59862beb0d45da10cc">2000s</p>
</td>
</tr>
<tr id="mrs_01_2031__re2939edc470e43ef81d4bf735741434a"><td class="cellrowborder" valign="top" width="39.756024397560246%" headers="mcps1.3.4.2.4.1.1 "><p id="mrs_01_2031__aa90825928c864e5fa4aca36b3eeaf1b5"><span id="mrs_01_2031__p2217717fee914b26bafc24ded2087e9a">spark.shuffle.io.connectionTimeout</span></p>
</td>
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.4.2.4.1.2 "><p id="mrs_01_2031__a8bb0bb3d716c41e385a9b38062aa60f1">Timeout interval for connecting to a remote node during the shuffle operation. Unit: second</p>
</td>
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.4.2.4.1.3 "><p id="mrs_01_2031__accc857c2957d41368e67008d82bfee01">3000s</p>
</td>
</tr>
<tr id="mrs_01_2031__r57f35c358a1b43c194ecd2477d9160e5"><td class="cellrowborder" valign="top" width="39.756024397560246%" headers="mcps1.3.4.2.4.1.1 "><p id="mrs_01_2031__ae1c570265b7e412686a47c916e9a7167"><span id="mrs_01_2031__pa3527ac3e7ff4a059a909fae44d27cf5">spark.network.timeout</span></p>
</td>
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.4.2.4.1.2 "><p id="mrs_01_2031__a4f7115d85534487390dc6c1086f3a41f">Timeout interval for all network connection operations. Unit: second</p>
</td>
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.4.2.4.1.3 "><p id="mrs_01_2031__a3459b35dc7aa451b9782adb849674d8b">360s</p>
</td>
</tr>
</tbody>
</table>
</div>
<p id="mrs_01_2031__a9fad2560a194483ea0060439e8682097"></p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_2022.html">Spark SQL and DataFrame</a></div>
</div>
</div>