Yang, Tong 3f5759eed2 MRS comp-lts 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2023-01-19 17:08:45 +00:00

53 lines
6.9 KiB
HTML

<a name="mrs_01_24036"></a><a name="mrs_01_24036"></a>
<h1 class="topictitle1">Stream Write</h1>
<div id="body32001227"><p id="mrs_01_24036__en-us_topic_0000001173631458_p8060118">The HoodieDeltaStreamer tool provided by Hudi supports stream write. You can also use SparkStreaming to write data in microbatch mode. HoodieDeltaStreamer provides the following functions:</p>
<ul id="mrs_01_24036__en-us_topic_0000001173631458_ul11839154814112"><li id="mrs_01_24036__en-us_topic_0000001173631458_li11839194881119">Supports multiple data sources, such as Kafka and DFS.</li><li id="mrs_01_24036__en-us_topic_0000001173631458_li9839154891115">Manages checkpoints, rollback, and recovery to ensure exactly-once semantics.</li><li id="mrs_01_24036__en-us_topic_0000001173631458_li28395481110">Supports user-defined transformations.</li></ul>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p20792185742216">Example:</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p63958215255">Prepare the configuration file <strong id="mrs_01_24036__en-us_topic_0000001173631458_b17521373144">kafka-source.properties</strong>.</p>
<pre class="screen" id="mrs_01_24036__en-us_topic_0000001173631458_screen610745911126">#Hudi configuration
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=age
hoodie.upsert.shuffle.parallelism=100
#hive config
hoodie.datasource.hive_sync.table=hudimor_deltastreamer_partition
hoodie.datasource.hive_sync.partition_fields=age
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.datasource.hive_sync.support_timestamp=true
# Kafka Source topic
hoodie.deltastreamer.source.kafka.topic=hudimor_deltastreamer_partition
#checkpoint
hoodie.deltastreamer.checkpoint.provider.path=hdfs://hacluster/tmp/huditest/hudimor_deltastreamer_partition
# Kafka props
# The kafka cluster we want to ingest from
bootstrap.servers= xx.xx.xx.xx:xx
auto.offset.reset=earliest
#auto.offset.reset=latest
group.id=hoodie-delta-streamer
offset.rang.limit=10000</pre>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p125405434265">Run the following commands to specify the HoodieDeltaStreamer execution parameters (for details about the parameter configuration, visit the official website <a href="https://hudi.apache.org/" target="_blank" rel="noopener noreferrer">https://hudi.apache.org/ </a>):</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p13711104718296"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b1167715101617">spark-submit --master yarn</strong></p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p10196105418327"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b31721315191618">--jars /opt/hudi-java-examples-1.0.jar</strong> // Specify the Hudi <strong id="mrs_01_24036__en-us_topic_0000001173631458_b879105164616">jars</strong> directory required for Spark running.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p10258112312255"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b2971122112161">--driver-memory 1g</strong></p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p16258162318259"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b109791221111616">--executor-memory 1g --executor-cores 1 --num-executors 2 --conf spark.kryoserializer.buffer.max=128m</strong></p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p1325822312258"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b398319214165">--driver-class-path /opt/client/Hudi/hudi/conf:/opt/client/Hudi/hudi/lib/*:/opt/client/Spark2x/spark/jars/*:/opt/hudi-examples-0.6.1-SNAPSHOT.jar:/opt/hudi-examples-0.6.1-SNAPSHOT-tests.jar</strong> // Specify the Hudi <strong id="mrs_01_24036__en-us_topic_0000001173631458_b1914874732017">jars</strong> directory required by the Spark driver.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p1358413138294"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b10914112861618">--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer spark-internal</strong></p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p14258723192519"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b1919102881610">--props file:///opt/kafka-source.properties</strong> // Specify the configuration file. You need to set the configuration file path to the HDFS path when submitting tasks in yarn-cluster mode.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p9735181842919"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b152717353173">--target-base-path /tmp/huditest/hudimor1_deltastreamer_partition</strong> // Specify the path of the Hudi table.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p1746352217297"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b10445143911170">--table-type MERGE_ON_READ</strong> // Specify the type of the Hudi table to be written.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p1625862382510"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b1012054361714">--target-table hudimor_deltastreamer_partition</strong> // Specify the Hudi table name.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p6129105816357"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b3909146161714">--source-ordering-field name</strong> // Specify the columns to be pre-combined in the Hudi table.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p15698145711286"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b4423145261711">--source-class org.apache.hudi.utilities.sources.JsonKafkaSource</strong> // Set the consumed data source to <strong id="mrs_01_24036__en-us_topic_0000001173631458_b4597131913517">JsonKafkaSource</strong>. Different source classes are specified based on different data sources.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p325832317251"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b1414083510182">--schemaprovider-class com.xxxx.bigdata.hudi.examples.DataSchemaProviderExample</strong> // Specify the schema required by the Hudi table.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p126844719286"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b4139139161814">--transformer-class com.xxx.bigdata.hudi.examples.TransformerExample</strong> // Specify how to process the data obtained from the data source. Set this parameter based on service requirements.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p964514504288"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b129210311288">--enable-hive-sync</strong> // Enable Hive synchronization to synchronize the Hudi table to Hive.</p>
<p id="mrs_01_24036__en-us_topic_0000001173631458_p172581323182511"><strong id="mrs_01_24036__en-us_topic_0000001173631458_b7255503187">--continuous</strong> // Set the stream processing mode to <strong id="mrs_01_24036__en-us_topic_0000001173631458_b187325464213">continuous</strong>.</p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_24034.html">Write</a></div>
</div>
</div>