Batch Write

Scenario

Hudi provides multiple write modes. For details, see the configuration item hoodie.datasource.write.operation. This section describes upsert, insert, and bulk_insert.

  • Primary keys are not sorted during insert. Therefore, you are not advised to use insert during dataset initialization.
  • You are advised to use insert if data is new, use upsert if data needs to be updated, and use bulk_insert if datasets need to be initialized.

Writing Data to Hudi Tables In Batches

  1. Import the Hudi package to generate test data. For details, see 2 to 4 in Getting Started.
  2. Add the option("hoodie.datasource.write.operation", "bulk_insert") parameter to the command for writing data to a Hudi table to set the write mode to bulk_insert. For example:
    df.write.format("org.apache.hudi").
    options(getQuickstartWriteConfigs).
    option("hoodie.datasource.write.precombine.field", "ts").
    option("hoodie.datasource.write.recordkey.field", "uuid").
    option("hoodie.datasource.write.partitionpath.field", "").
    option("hoodie.datasource.write.operation", "bulk_insert").
    option("hoodie.table.name", tableName).
    option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
    option("hoodie.datasource.hive_sync.enable", "true").
    option("hoodie.datasource.hive_sync.partition_fields", "").
    option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor").
    option("hoodie.datasource.hive_sync.table", tableName).
    option("hoodie.datasource.hive_sync.use_jdbc", "false").
    option("hoodie.bulkinsert.shuffle.parallelism", 4).
    mode(Overwrite).
    save(basePath)
    • For details about the parameters in the example, see Table 1.
    • If the Spark DataSource API is used to update the MOR table, small files of the updated data may be merged when a small volume of data is inserted. As a result, some updated data can be found in the read-optimized view of the MOR table.
    • If the base file of the data to be updated is a small file, the data to be inserted and new data for update are merged with the base file to generate a new base file instead of being written to logs.

Configuring Partitions

Hudi supports multiple partitioning modes, such as multi-level partitioning, non-partitioning, single-level partitioning, and partitioning by date. You can select a proper partitioning mode as required. The following describes how to configure different partitioning modes for Hudi.