Yang, Tong 6182f91ba8 MRS component operation guide_normal 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-09 14:55:21 +00:00

187 lines
19 KiB
HTML

<a name="mrs_01_24035"></a><a name="mrs_01_24035"></a>
<h1 class="topictitle1">Batch Write</h1>
<div id="body0000001081783758"><div class="section" id="mrs_01_24035__section027962805014"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_24035__p8060118">Hudi provides multiple write modes. For details, see the configuration item <strong id="mrs_01_24035__b76921557419">hoodie.datasource.write.operation</strong>. This section describes <strong id="mrs_01_24035__b176981159414">upsert</strong>, <strong id="mrs_01_24035__b1669845147">insert</strong>, and <strong id="mrs_01_24035__b169985243">bulk_insert</strong>.</p>
<ul id="mrs_01_24035__ul9734944113114"><li id="mrs_01_24035__li165961848113111"><strong id="mrs_01_24035__b53531517220">insert</strong>: The operation process is similar to <strong id="mrs_01_24035__b14354451427">upsert</strong>. The query on updated file partitions is not based on indexes. Therefore, <strong id="mrs_01_24035__b20354155111218">insert</strong> is faster than <strong id="mrs_01_24035__b835412511128">upsert</strong>. This operation is recommended for data sources that do not contain updated data. If the data source contains updated data, duplicate data will exist in the data lake.</li><li id="mrs_01_24035__li196117525311"><strong id="mrs_01_24035__b134592216410">bulk_insert</strong> (insert in batches): It is used for initial dataset loading. This operation sorts primary keys and then inserts data into a Hudi table by writing data to a common Parquet table. It has the best performance but cannot control small files. The <strong id="mrs_01_24035__b14609211546">upsert</strong> and <strong id="mrs_01_24035__b146172115412">insert</strong> operations can control small files by using heuristics.</li><li id="mrs_01_24035__li3742101711107"><strong id="mrs_01_24035__b218112618410">upsert</strong> (insert and update): It is the default operation type. Hudi determines whether historical data exists based on the primary key. Historical data is updated, and other data is inserted. This operation is recommended for data sources, such as change data capture (CDC), that include updated data.</li></ul>
<div class="note" id="mrs_01_24035__note1494105851616"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="mrs_01_24035__ul3798195015318"><li id="mrs_01_24035__li14798125019313">Primary keys are not sorted during <strong id="mrs_01_24035__b1740519919409">insert</strong>. Therefore, you are not advised to use <strong id="mrs_01_24035__b13405199134011">insert</strong> during dataset initialization.</li><li id="mrs_01_24035__li6799350131">You are advised to use <strong id="mrs_01_24035__b17541020194014">insert</strong> if data is new, use <strong id="mrs_01_24035__b11603202400">upsert</strong> if data needs to be updated, and use <strong id="mrs_01_24035__b206172014014">bulk_insert</strong> if datasets need to be initialized.</li></ul>
</div></div>
</div>
<div class="section" id="mrs_01_24035__section749043265111"><h4 class="sectiontitle">Writing Data to Hudi Tables In Batches</h4><ol id="mrs_01_24035__ol16216638171512"><li id="mrs_01_24035__li192161038141516">Import the Hudi package to generate test data. For details, see <a href="mrs_01_24033.html#mrs_01_24033__li6424125918379">2</a> to <a href="mrs_01_24033.html#mrs_01_24033__li654313073616">4</a> in <a href="mrs_01_24033.html">Getting Started</a>.</li><li id="mrs_01_24035__li1953116488175">Add the <strong id="mrs_01_24035__b04611113919">option("hoodie.datasource.write.operation", "bulk_insert")</strong> parameter to the command for writing data to a Hudi table to set the write mode to bulk_insert. For example:<pre class="screen" id="mrs_01_24035__screen17490349195">df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.recordkey.field", "uuid").
option("hoodie.datasource.write.partitionpath.field", "").
<strong id="mrs_01_24035__b10519027173013">option("hoodie.datasource.write.operation", "bulk_insert")</strong>.
option("hoodie.table.name", tableName).
option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
option("hoodie.datasource.hive_sync.enable", "true").
option("hoodie.datasource.hive_sync.partition_fields", "").
option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor").
option("hoodie.datasource.hive_sync.table", tableName).
option("hoodie.datasource.hive_sync.use_jdbc", "false").
option("hoodie.bulkinsert.shuffle.parallelism", 4).
mode(Overwrite).
save(basePath)</pre>
<div class="note" id="mrs_01_24035__note414974584913"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="mrs_01_24035__ul1469264874914"><li id="mrs_01_24035__li2813185119204">For details about the parameters in the example, see <a href="mrs_01_24093.html#mrs_01_24093__table1815615307121">Table 1</a>.</li><li id="mrs_01_24035__li869274894910">If the Spark DataSource API is used to update the MOR table, small files of the updated data may be merged when a small volume of data is inserted. As a result, some updated data can be found in the read-optimized view of the MOR table.</li><li id="mrs_01_24035__li9692174814491">If the base file of the data to be updated is a small file, the data to be inserted and new data for update are merged with the base file to generate a new base file instead of being written to logs.</li></ul>
</div></div>
</li></ol>
</div>
<div class="section" id="mrs_01_24035__section156299412448"><h4 class="sectiontitle">Configuring Partitions</h4><p id="mrs_01_24035__p181081744414">Hudi supports multiple partitioning modes, such as multi-level partitioning, non-partitioning, single-level partitioning, and partitioning by date. You can select a proper partitioning mode as required. The following describes how to configure different partitioning modes for Hudi.</p>
</div>
<ul id="mrs_01_24035__ul91969562562"><li id="mrs_01_24035__li171967560566">Multi-level partitioning<p id="mrs_01_24035__p82191118165614"><a name="mrs_01_24035__li171967560566"></a><a name="li171967560566"></a>Multi-level partitioning indicates that multiple fields are specified as partition keys. Pay attention to the following configuration items:</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_24035__table12536135911448" frame="border" border="1" rules="all"><thead align="left"><tr id="mrs_01_24035__row165361159124416"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.1.2.1.3.1.1"><p id="mrs_01_24035__p18536659114412">Configuration Item</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.1.2.1.3.1.2"><p id="mrs_01_24035__p153775924416">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_24035__row0537185944414"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.1 "><p id="mrs_01_24035__p17537105914412">hoodie.datasource.write.partitionpath.field</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.2 "><p id="mrs_01_24035__p75375594442">Configure multiple partition fields, for example, <strong id="mrs_01_24035__b19896903939458">p1</strong>, <strong id="mrs_01_24035__b3493854019458">p2</strong>, and <strong id="mrs_01_24035__b21218018549458">p3</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row15839108184516"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.1 "><p id="mrs_01_24035__p58401989459">hoodie.datasource.hive_sync.partition_fields</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.2 "><p id="mrs_01_24035__p14269131895712">Set this parameter to <strong id="mrs_01_24035__b12382830868">p1</strong>, <strong id="mrs_01_24035__b7383133017611">p2</strong>, and <strong id="mrs_01_24035__b1038413016613">p3</strong>. The values must be the same as the partition fields of <strong id="mrs_01_24035__b132360291279">hoodie.datasource.write.partitionpath.field</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row653715596445"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.1 "><p id="mrs_01_24035__p14537759194418">hoodie.datasource.write.keygenerator.class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.2 "><p id="mrs_01_24035__p1053735914411">Set this parameter to <strong id="mrs_01_24035__b12733157777">org.apache.hudi.keygen.ComplexKeyGenerator</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row14537859164414"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.1 "><p id="mrs_01_24035__p15537759124411">hoodie.datasource.hive_sync.partition_extractor_class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.1.2.1.3.1.2 "><p id="mrs_01_24035__p853785911441">Set this parameter to <strong id="mrs_01_24035__b45804519817">org.apache.hudi.hive.MultiPartKeysValueExtractor</strong>.</p>
</td>
</tr>
</tbody>
</table>
</div>
</li></ul>
<ul id="mrs_01_24035__ul178191845937"><li id="mrs_01_24035__li8819144517317">Non-partitioning<p id="mrs_01_24035__p0755011214"><a name="mrs_01_24035__li8819144517317"></a><a name="li8819144517317"></a>Hudi supports non-partitioned tables. Pay attention to the following configuration items:</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_24035__table2080593016473" frame="border" border="1" rules="all"><thead align="left"><tr id="mrs_01_24035__row4807530124716"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.5.1.2.1.3.1.1"><p id="mrs_01_24035__p20807193018471">Configuration Item</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.5.1.2.1.3.1.2"><p id="mrs_01_24035__p18807130124710">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_24035__row108076306474"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.1 "><p id="mrs_01_24035__p19807123034718">hoodie.datasource.write.partitionpath.field</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.2 "><p id="mrs_01_24035__p18078303478">Leave this parameter blank.</p>
</td>
</tr>
<tr id="mrs_01_24035__row208071930144718"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.1 "><p id="mrs_01_24035__p58075302479">hoodie.datasource.hive_sync.partition_fields</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.2 "><p id="mrs_01_24035__p6807530164714">Leave this parameter blank.</p>
</td>
</tr>
<tr id="mrs_01_24035__row68071430184718"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.1 "><p id="mrs_01_24035__p1680713054716">hoodie.datasource.write.keygenerator.class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.2 "><p id="mrs_01_24035__p4807930134713">Set this parameter to <strong id="mrs_01_24035__b57421793917">org.apache.hudi.keygen.NonpartitionedKeyGenerator</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row108077308476"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.1 "><p id="mrs_01_24035__p280753004710">hoodie.datasource.hive_sync.partition_extractor_class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.5.1.2.1.3.1.2 "><p id="mrs_01_24035__p3807143004720">Set this parameter to <strong id="mrs_01_24035__b1087010271196">org.apache.hudi.hive.NonPartitionedExtractor</strong>.</p>
</td>
</tr>
</tbody>
</table>
</div>
</li></ul>
<ul id="mrs_01_24035__ul1461625111612"><li id="mrs_01_24035__li561719518612">Single-level partitioning<p id="mrs_01_24035__p95233182043"><a name="mrs_01_24035__li561719518612"></a><a name="li561719518612"></a>It is similar to multi-level partitioning. Pay attention to the following configuration items:</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_24035__table159054496487" frame="border" border="1" rules="all"><thead align="left"><tr id="mrs_01_24035__row10906194914485"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.6.1.2.1.3.1.1"><p id="mrs_01_24035__p5906154918483">Configuration Item</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.6.1.2.1.3.1.2"><p id="mrs_01_24035__p119061491481">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_24035__row590644964817"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.1 "><p id="mrs_01_24035__p139067493485">hoodie.datasource.write.partitionpath.field</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.2 "><p id="mrs_01_24035__p5906194924813">Set this parameter to one field, for example, <strong id="mrs_01_24035__b23882671019">p</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row179060494485"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.1 "><p id="mrs_01_24035__p132351551134910">hoodie.datasource.hive_sync.partition_fields</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.2 "><p id="mrs_01_24035__p203931147195616">Set this parameter to <strong id="mrs_01_24035__b09845187100">p</strong>.</p>
<p id="mrs_01_24035__p193921947125612">The value must be the same as the partition field of</p>
<p id="mrs_01_24035__p890604974810"><strong id="mrs_01_24035__b361585484910">hoodie.datasource.write.partitionpath.field</strong></p>
</td>
</tr>
<tr id="mrs_01_24035__row29061249194812"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.1 "><p id="mrs_01_24035__p69061949194810">hoodie.datasource.write.keygenerator.class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.2 "><p id="mrs_01_24035__p15906144913487">(Optional) The default value is <strong id="mrs_01_24035__b9799261279458">org.apache.hudi.keygen.SimpleKeyGenerator</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row14906134917489"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.1 "><p id="mrs_01_24035__p690613492484">hoodie.datasource.hive_sync.partition_extractor_class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.6.1.2.1.3.1.2 "><p id="mrs_01_24035__p18906164924817">Set this parameter to <strong id="mrs_01_24035__b161137113123">org.apache.hudi.hive.MultiPartKeysValueExtractor</strong>.</p>
</td>
</tr>
</tbody>
</table>
</div>
</li></ul>
<ul id="mrs_01_24035__ul89340011110"><li id="mrs_01_24035__li4935150171112">Partitioning by date<p id="mrs_01_24035__p843713716710"><a name="mrs_01_24035__li4935150171112"></a><a name="li4935150171112"></a>The <strong id="mrs_01_24035__b41742994511">date</strong> field is specified as the partition field. Pay attention to the following configuration items:</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_24035__table188991948175317" frame="border" border="1" rules="all"><thead align="left"><tr id="mrs_01_24035__row16899194835313"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.7.1.2.1.3.1.1"><p id="mrs_01_24035__p19899748125313">Configuration Item</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.7.1.2.1.3.1.2"><p id="mrs_01_24035__p15899194885315">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_24035__row13899164815313"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.1 "><p id="mrs_01_24035__p589914489532">hoodie.datasource.write.partitionpath.field</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.2 "><p id="mrs_01_24035__p12899164813535">Set this parameter to the <strong id="mrs_01_24035__b5176122015121">date</strong> field, for example, <strong id="mrs_01_24035__b91762205122">operationTime</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row1089954812534"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.1 "><p id="mrs_01_24035__p989984818532">hoodie.datasource.hive_sync.partition_fields</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.2 "><p id="mrs_01_24035__p389918489535">Set this parameter to <strong id="mrs_01_24035__b12781214099458">operationTime</strong>. The value must be the same as the preceding partition field.</p>
</td>
</tr>
<tr id="mrs_01_24035__row48991648145319"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.1 "><p id="mrs_01_24035__p789994855310">hoodie.datasource.write.keygenerator.class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.2 "><p id="mrs_01_24035__p589934818534">(Optional) The default value is <strong id="mrs_01_24035__b25414819137">org.apache.hudi.keygen.SimpleKeyGenerator</strong>.</p>
</td>
</tr>
<tr id="mrs_01_24035__row10899148145311"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.1 "><p id="mrs_01_24035__p128993486536">hoodie.datasource.hive_sync.partition_extractor_class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.1.2.1.3.1.2 "><p id="mrs_01_24035__p1189912487530">Set this parameter to <strong id="mrs_01_24035__b61623611311">org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor</strong>.</p>
</td>
</tr>
</tbody>
</table>
</div>
<div class="note" id="mrs_01_24035__note12281558196"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_24035__p58341058181917">Date format for <strong id="mrs_01_24035__b1690891115214">SlashEncodedDayPartitionValueExtractor</strong> must be <em id="mrs_01_24035__i21501817175117">yyyy/mm/dd</em>.</p>
</div></div>
</li><li id="mrs_01_24035__li11456539125715">Partition sorting
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_24035__table16367734145513" frame="border" border="1" rules="all"><thead align="left"><tr id="mrs_01_24035__row1436714341558"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.7.2.1.1.3.1.1"><p id="mrs_01_24035__p5367113435510">Configuration Item</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.7.2.1.1.3.1.2"><p id="mrs_01_24035__p8367134185515">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_24035__row12367934145513"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.2.1.1.3.1.1 "><p id="mrs_01_24035__p18368134175519">hoodie.bulkinsert.user.defined.partitioner.class</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.7.2.1.1.3.1.2 "><p id="mrs_01_24035__p1368163419551">Specifies the partition sorting class. You can customize a sorting method. For details, see the sample code.</p>
</td>
</tr>
</tbody>
</table>
</div>
<div class="note" id="mrs_01_24035__note10170618192018"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_24035__p1283113294310">By default, <strong id="mrs_01_24035__b1914518235427">bulk_insert</strong> sorts data by character and applies only to primary keys of StringType.</p>
</div></div>
</li></ul>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_24034.html">Write</a></div>
</div>
</div>