Compaction

A compaction merges base and log files of MOR tables.

For MOR tables, data is stored in columnar Parquet files and row-based Avro files, updates are recorded in incremental files, and then a synchronous or asynchronous compaction is performed to generate new versions of columnar files. MOR tables can reduce data ingestion latency, so an asynchronous compaction that does not block ingestion is useful.

An asynchronous compaction is performed in the following two steps:

  1. Scheduling a compaction: A compaction is completed by the job of importing data into the data lake. In this step, Hudi scans partitions and selects the file slices to be compacted. A compaction plan is finally written to the Hudi timeline.
  2. Executing a compaction: A separate process or thread reads the compaction plan and performs the compaction of file slices.

Compaction can be synchronous or asynchronous.

Synchronization modes

Asynchronous modes