Hudi provides multiple write modes. For details, see the configuration item hoodie.datasource.write.operation. This section describes upsert, insert, and bulk_insert.
df.write.format("org.apache.hudi"). options(getQuickstartWriteConfigs). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.recordkey.field", "uuid"). option("hoodie.datasource.write.partitionpath.field", ""). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.table.name", tableName). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option("hoodie.datasource.hive_sync.enable", "true"). option("hoodie.datasource.hive_sync.partition_fields", ""). option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor"). option("hoodie.datasource.hive_sync.table", tableName). option("hoodie.datasource.hive_sync.use_jdbc", "false"). option("hoodie.bulkinsert.shuffle.parallelism", 4). mode(Overwrite). save(basePath)
Hudi supports multiple partitioning modes, such as multi-level partitioning, non-partitioning, single-level partitioning, and partitioning by date. You can select a proper partitioning mode as required. The following describes how to configure different partitioning modes for Hudi.
Multi-level partitioning indicates that multiple fields are specified as partition keys. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Configure multiple partition fields, for example, p1, p2, and p3. |
hoodie.datasource.hive_sync.partition_fields |
Set this parameter to p1, p2, and p3. The values must be the same as the partition fields of hoodie.datasource.write.partitionpath.field. |
hoodie.datasource.write.keygenerator.class |
Set this parameter to org.apache.hudi.keygen.ComplexKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.MultiPartKeysValueExtractor. |
Hudi supports non-partitioned tables. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Leave this parameter blank. |
hoodie.datasource.hive_sync.partition_fields |
Leave this parameter blank. |
hoodie.datasource.write.keygenerator.class |
Set this parameter to org.apache.hudi.keygen.NonpartitionedKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.NonPartitionedExtractor. |
It is similar to multi-level partitioning. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Set this parameter to one field, for example, p. |
hoodie.datasource.hive_sync.partition_fields |
Set this parameter to p. The value must be the same as the partition field of hoodie.datasource.write.partitionpath.field |
hoodie.datasource.write.keygenerator.class |
(Optional) The default value is org.apache.hudi.keygen.SimpleKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.MultiPartKeysValueExtractor. |
The date field is specified as the partition field. Pay attention to the following configuration items:
Configuration Item |
Description |
---|---|
hoodie.datasource.write.partitionpath.field |
Set this parameter to the date field, for example, operationTime. |
hoodie.datasource.hive_sync.partition_fields |
Set this parameter to operationTime. The value must be the same as the preceding partition field. |
hoodie.datasource.write.keygenerator.class |
(Optional) The default value is org.apache.hudi.keygen.SimpleKeyGenerator. |
hoodie.datasource.hive_sync.partition_extractor_class |
Set this parameter to org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor. |
Date format for SlashEncodedDayPartitionValueExtractor must be yyyy/mm/dd.
Configuration Item |
Description |
---|---|
hoodie.bulkinsert.user.defined.partitioner.class |
Specifies the partition sorting class. You can customize a sorting method. For details, see the sample code. |
By default, bulk_insert sorts data by character and applies only to primary keys of StringType.