Yang, Tong 6182f91ba8 MRS component operation guide_normal 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-09 14:55:21 +00:00

32 lines
6.1 KiB
HTML

<a name="mrs_01_24090"></a><a name="mrs_01_24090"></a>
<h1 class="topictitle1">Compaction</h1>
<div id="body0000001150888227"><p id="mrs_01_24090__p19134313327">A compaction merges base and log files of MOR tables.</p>
<p id="mrs_01_24090__p105421845144513">For MOR tables, data is stored in columnar Parquet files and row-based Avro files, updates are recorded in incremental files, and then a synchronous or asynchronous compaction is performed to generate new versions of columnar files. MOR tables can reduce data ingestion latency, so an asynchronous compaction that does not block ingestion is useful.</p>
<p id="mrs_01_24090__p1586832294815">An asynchronous compaction is performed in the following two steps:</p>
<ol id="mrs_01_24090__ol011982815456"><li id="mrs_01_24090__li21191028194514">Scheduling a compaction: A compaction is completed by the job of importing data into the data lake. In this step, Hudi scans partitions and selects the file slices to be compacted. A compaction plan is finally written to the Hudi timeline.</li><li id="mrs_01_24090__li111982811458">Executing a compaction: A separate process or thread reads the compaction plan and performs the compaction of file slices.</li></ol>
<p id="mrs_01_24090__p105483140599">Compaction can be synchronous or asynchronous.</p>
<p id="mrs_01_24090__p1238511794912"><strong id="mrs_01_24090__b144361261443">Synchronization modes</strong></p>
<ul id="mrs_01_24090__ul10193165244619"><li id="mrs_01_24090__li151501225134916">When HoodieDeltaStreamer is used to write upstream data (Kafka/DFS) to a Hudi dataset, the default value of <strong id="mrs_01_24090__b128845827103946">--disable-compaction</strong> is <strong id="mrs_01_24090__b180824881803946">false</strong>, indicating that a compaction is automatically executed.</li><li id="mrs_01_24090__li7968057195215">Using DataSource to specify parameters when writing data<p id="mrs_01_24090__p96471259534"><a name="mrs_01_24090__li7968057195215"></a><a name="li7968057195215"></a><strong id="mrs_01_24090__b180214252229">option("hoodie.compact.inline", "true").</strong></p>
<p id="mrs_01_24090__p09664599522"><strong id="mrs_01_24090__b180372572220">option("hoodie.compact.inline.max.delta.commits", "2").</strong></p>
</li></ul>
<p id="mrs_01_24090__p345116852217"><strong id="mrs_01_24090__b18866664317">Asynchronous modes</strong></p>
<ul id="mrs_01_24090__ul154727812215"><li id="mrs_01_24090__li2471884227">Using Hudi CLI<p id="mrs_01_24090__p7471683220"><a name="mrs_01_24090__li2471884227"></a><a name="li2471884227"></a>Scheduling a compaction:</p>
<p id="mrs_01_24090__p64718882212"><strong id="mrs_01_24090__b1430943082215">compaction schedule --hoodieConfigs 'hoodie.compaction.strategy=org.apache.hudi.table.action.compact.strategy.BoundedIOCompactionStrategy,hoodie.compaction.target.io=1,hoodie.compact.inline.max.delta.commits=1'</strong></p>
<p id="mrs_01_24090__p11471787226">Executing a compaction:</p>
<p id="mrs_01_24090__p347138152216"><strong id="mrs_01_24090__b176133472219">compaction run --parallelism 100 --sparkMemory 1g --retry 1 --compactionInstant 20210602101315 --hoodieConfigs 'hoodie.compaction.strategy=org.apache.hudi.table.action.compact.strategy.BoundedIOCompactionStrategy,hoodie.compaction.target.io=1,hoodie.compact.inline.max.delta.commits=1' --propsFilePath hdfs://hacluster/tmp/default/tb_test_mor/.hoodie/hoodie.properties --schemaFilePath /tmp/default/tb_test_mor/.hoodie/compact_tb_base.json</strong></p>
</li><li id="mrs_01_24090__li5472108202218">Using APIs<p id="mrs_01_24090__p247215819229"><a name="mrs_01_24090__li5472108202218"></a><a name="li5472108202218"></a>Scheduling a compaction:</p>
<p id="mrs_01_24090__p4472118112218"><strong id="mrs_01_24090__b56541656184718">spark-submit --master yarn --jars /opt/client/Hudi/hudi/lib/hudi-client-common-</strong><em id="mrs_01_24090__i18430105724719">xxx</em><strong id="mrs_01_24090__b065217221484">.jar --class org.apache.hudi.utilities.HoodieCompactor /opt/client/Hudi/hudi/lib/hudi-utilities_</strong><em id="mrs_01_24090__i522013233486">xxx</em><strong id="mrs_01_24090__b36521522154817">.jar --base-path /tmp/default/tb_test_mor --table-name tb_test_mor --parallelism 100 --spark-memory 1G --schema-file /tmp/default/tb_test_mor/.hoodie/compact_tb_base.json --instant-time 20210602141810 --schedule --strategy org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy</strong></p>
<p id="mrs_01_24090__p1647278202214">Executing a compaction:</p>
<p id="mrs_01_24090__p247211832214"><strong id="mrs_01_24090__b7407204117485">spark-submit --master yarn --jars /opt/client/Hudi/hudi/lib/hudi-client-common-</strong><em id="mrs_01_24090__i97542204813">xxx</em><strong id="mrs_01_24090__b6483145417484">.jar --class org.apache.hudi.utilities.HoodieCompactor /opt/client/Hudi/hudi/lib/hudi-utilities_</strong><em id="mrs_01_24090__i169335511482">xxx</em><strong id="mrs_01_24090__b15483115418482">.jar --base-path /tmp/default/tb_test_mor --table-name tb_test_mor --parallelism 100 --spark-memory 1G --schema-file /tmp/default/tb_test_mor/.hoodie/compact_tb_base.json --instant-time 20210602141810</strong></p>
<div class="note" id="mrs_01_24090__note1547218813222"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="mrs_01_24090__ul14721812223"><li id="mrs_01_24090__li847219820221">When using Hudi CLI to schedule a compaction, you do not need to specify <strong id="mrs_01_24090__b63522016141717">instant-time</strong>, which is automatically generated and returned by the system after the scheduling is successful. You only need to pass this parameter during execution.</li><li id="mrs_01_24090__li104724817222">For <strong id="mrs_01_24090__b135203817003946">schema-file</strong>, you need to manually edit the schema file of the current Hudi table and upload it to the server. You can use the schema in the latest <strong id="mrs_01_24090__b84824827503946">.commit</strong> file.</li></ul>
</div></div>
</li></ul>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_24038.html">Data Management and Maintenance</a></div>
</div>
</div>