Batch Write

Hudi provides multiple write modes. For details, see the configuration item hoodie.datasource.write.operation. This section describes upsert, insert, and bulk_insert.

  • Primary keys are not sorted during insert. Therefore, you are not advised to use insert during dataset initialization.
  • You are advised to use insert if data is new, use upsert if data needs to be updated, and use bulk_insert if datasets need to be initialized.

Example:

df.write.format("hudi").
        option(PRECOMBINE_FIELD_OPT_KEY, "col4").// Specify the pre-combined field, which must be sortable.
        option(RECORDKEY_FIELD_OPT_KEY, "primary_key"). // Specify the primary key of the Hudi table. The primary key must be unique.
        option(PARTITIONPATH_FIELD_OPT_KEY, "col0").// Specify a partition.
        option(OPERATION_OPT_KEY, "bulk_insert").// Specify that the operation is bulk_insert.
        option("hoodie.bulkinsert.shuffle.parallelism", par.toString).// Specify the concurrency of the bulk_insert operation.
        option(HIVE_SYNC_ENABLED_OPT_KEY, "true").// Specify the synchronization of the Hudi table to Hive.
        option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").// Specify the Hive partition column name.
        option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").
        option(HIVE_DATABASE_OPT_KEY, db).
        option(HIVE_TABLE_OPT_KEY, tableName).
        option(HIVE_USE_JDBC_OPT_KEY, "false").// Specify whether to use JDBC for Hive synchronization. The default value is true.
        option(TABLE_NAME, tableName). // Specify the table name.
	mode(Overwrite). // Specify the write mode.
	save(s"/tmp/${db}/${tableName}")// Specify the storage path of the Hudi table.
  • If the Spark DataSource API is used to update the MOR table, small files of the updated data may be merged when a small volume of data is upserted. As a result, some updated data can be found in the read-optimized view of the MOR table.
  • If the base file of the data to be updated is a small file, the data to be inserted and new data for update are merged with the base file to generate a new base file instead of being written to logs.

Configuring Partitions

Hudi supports multiple partitioning modes, such as multi-level partitioning, non-partitioning, single-level partitioning, and partitioning by date. You can select a proper partitioning mode as required. The following describes how to configure different partitioning modes for Hudi.