A compaction merges base and log files of MOR tables.
For MOR tables, data is stored in columnar Parquet files and row-based Avro files, updates are recorded in incremental files, and then a synchronous or asynchronous compaction is performed to generate new versions of columnar files. MOR tables can reduce data ingestion latency, so an asynchronous compaction that does not block ingestion is useful.
An asynchronous compaction is performed in the following two steps:
Compaction can be synchronous or asynchronous.
Synchronization modes
option("hoodie.compact.inline", "true").
option("hoodie.compact.inline.max.delta.commits", "2").
Asynchronous modes
compaction schedule --hoodieConfigs 'hoodie.compaction.strategy=org.apache.hudi.table.action.compact.strategy.BoundedIOCompactionStrategy,hoodie.compaction.target.io=1,hoodie.compact.inline.max.delta.commits=1'
Executing a compaction:
compaction run --parallelism 100 --sparkMemory 1g --retry 1 --compactionInstant 20210602101315 --hoodieConfigs 'hoodie.compaction.strategy=org.apache.hudi.table.action.compact.strategy.BoundedIOCompactionStrategy,hoodie.compaction.target.io=1,hoodie.compact.inline.max.delta.commits=1' --propsFilePath hdfs://hacluster/tmp/default/tb_test_mor/.hoodie/hoodie.properties --schemaFilePath /tmp/default/tb_test_mor/.hoodie/compact_tb_base.json
spark-submit --master yarn --jars /opt/client/Hudi/hudi/lib/hudi-client-common-xxx.jar --class org.apache.hudi.utilities.HoodieCompactor /opt/client/Hudi/hudi/lib/hudi-utilities_xxx.jar --base-path /tmp/default/tb_test_mor --table-name tb_test_mor --parallelism 100 --spark-memory 1G --schema-file /tmp/default/tb_test_mor/.hoodie/compact_tb_base.json --instant-time 20210602141810 --schedule --strategy org.apache.hudi.table.action.compact.strategy.UnBoundedCompactionStrategy
Executing a compaction: