Hudi provides multiple write modes. For details, see the configuration item hoodie.datasource.write.operation. This section describes upsert, insert, and bulk_insert.
Example:
df.write.format("hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "col4").// Specify the pre-combined field, which must be sortable. option(RECORDKEY_FIELD_OPT_KEY, "primary_key"). // Specify the primary key of the Hudi table. The primary key must be unique. option(PARTITIONPATH_FIELD_OPT_KEY, "col0").// Specify a partition. option(OPERATION_OPT_KEY, "bulk_insert").// Specify that the operation is bulk_insert. option("hoodie.bulkinsert.shuffle.parallelism", par.toString).// Specify the concurrency of the bulk_insert operation. option(HIVE_SYNC_ENABLED_OPT_KEY, "true").// Specify the synchronization of the Hudi table to Hive. option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").// Specify the Hive partition column name. option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor"). option(HIVE_DATABASE_OPT_KEY, db). option(HIVE_TABLE_OPT_KEY, tableName). option(HIVE_USE_JDBC_OPT_KEY, "false").// Specify whether to use JDBC for Hive synchronization. The default value is true. option(TABLE_NAME, tableName). // Specify the table name. mode(Overwrite). // Specify the write mode. save(s"/tmp/${db}/${tableName}")// Specify the storage path of the Hudi table.
Hudi supports multiple partitioning modes, such as multi-level partitioning, non-partitioning, single-level partitioning, and partitioning by date. You can select a proper partitioning mode as required. The following describes how to configure different partitioning modes for Hudi.
Multi-level partitioning indicates that multiple fields are specified as partition keys. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Configure multiple partition fields, for example, p1, p2, and p3. |
hoodie.datasource.hive_sync.partition_fields |
Set this parameter to p1, p2, and p3. The values must be the same as the partition fields of hoodie.datasource.write.partitionpath.field. |
hoodie.datasource.write.keygenerator.class |
Set this parameter to org.apache.hudi.keygen.ComplexKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.MultiPartKeysValueExtractor. |
Hudi supports non-partitioned tables. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Leave this parameter blank. |
hoodie.datasource.hive_sync.partition_fields |
Leave this parameter blank. |
hoodie.datasource.write.keygenerator.class |
Set this parameter to org.apache.hudi.keygen.NonpartitionedKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.NonPartitionedExtractor. |
It is similar to multi-level partitioning. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Set this parameter to one field, for example, p. |
hoodie.datasource.hive_sync.partition_fields |
Set this parameter to p. The value must be the same as the partition field of hoodie.datasource.write.partitionpath.field |
hoodie.datasource.write.keygenerator.class |
(Optional) The default value is org.apache.hudi.keygen.SimpleKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.MultiPartKeysValueExtractor. |
The date field is specified as the partition field. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Set this parameter to the date field, for example, operationTime. |
hoodie.datasource.hive_sync.partition_fields |
Set this parameter to operationTime. The value must be the same as the preceding partition field. |
hoodie.datasource.write.keygenerator.class |
(Optional) The default value is org.apache.hudi.keygen.SimpleKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor. |
Date format for SlashEncodedDayPartitionValueExtractor must be yyyy/mm/dd.
Configuration Item |
Description |
---|---|
hoodie.bulkinsert.user.defined.partitioner.class |
Specifies the partition sorting class. You can customize a sorting method. For details, see the sample code. |
By default, bulk_insert sorts data by character and applies only to primary keys of StringType.