Yang, Tong 3f5759eed2 MRS comp-lts 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2023-01-19 17:08:45 +00:00

107 lines
20 KiB
HTML

<a name="mrs_01_24033"></a><a name="mrs_01_24033"></a>
<h1 class="topictitle1">Quick Start</h1>
<div id="body32001227"><div class="section" id="mrs_01_24033__en-us_topic_0000001219029253_section665164316242"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_24033__en-us_topic_0000001219029253_p5148178123214">This section describes capabilities of Hudi using spark-shell. Using the Spark data source, this section describes how to insert and update a Hudi dataset of the default storage mode Copy-on Write (COW) tables based on code snippets. After each write operation, you will be introduced how to read snapshot and incremental data.</p>
</div>
<div class="section" id="mrs_01_24033__en-us_topic_0000001219029253_section192812104215"><h4 class="sectiontitle">Prerequisites</h4><ul id="mrs_01_24033__en-us_topic_0000001219029253_ul7364164620447"><li id="mrs_01_24033__en-us_topic_0000001219029253_li1236412463446">You have downloaded and installed the Hudi client. Currently, Hudi is integrated in Spark2x. You only need to download the Spark2x client on Manager. For example, the client installation directory is <strong id="mrs_01_24033__en-us_topic_0000001219029253_b362811246479">/opt/client</strong>.</li><li id="mrs_01_24033__en-us_topic_0000001219029253_li2815112015460">You have created a user and added the user to user groups <strong id="mrs_01_24033__en-us_topic_0000001219029253_b185096412813">hadoop</strong> and <strong id="mrs_01_24033__en-us_topic_0000001219029253_b10500079814">hive</strong> on Manager.</li></ul>
</div>
<div class="section" id="mrs_01_24033__en-us_topic_0000001219029253_section13661165916315"><h4 class="sectiontitle">Procedure</h4><ol id="mrs_01_24033__en-us_topic_0000001219029253_ol18943144812358"><li id="mrs_01_24033__en-us_topic_0000001219029253_li20426610193114"><span>Download and install the Hudi client. For details, see <a href="mrs_01_0787.html">Using an MRS Client</a>.</span><p><div class="note" id="mrs_01_24033__en-us_topic_0000001219029253_note1528652712326"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_24033__en-us_topic_0000001219029253_p22862273327">Currently, Hudi is integrated in Spark2x. You only need to download the Spark2x client on Manager. For example, the client installation directory is <strong id="mrs_01_24033__en-us_topic_0000001219029253_b4350105913495">/opt/client</strong>.</p>
</div></div>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li6424125918379"><span>Log in to the node where the client is installed as user <strong id="mrs_01_24033__en-us_topic_0000001219029253_b1337718421528">root</strong> and run the following command:</span><p><p id="mrs_01_24033__en-us_topic_0000001219029253_p6725162313386"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b174082193119">cd /opt/client</strong></p>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li7810145113516"><span>Run the following commands to load environment variables:</span><p><p id="mrs_01_24033__en-us_topic_0000001219029253_p168151820193616"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1338342213512">source bigdata_env</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p42231120113619"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b12223132010362">source Hudi/component_env</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1022311204363"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b15938133716363">kinit </strong><em id="mrs_01_24033__en-us_topic_0000001219029253_i848712529462">Created user</em></p>
<div class="note" id="mrs_01_24033__en-us_topic_0000001219029253_note4405123718714"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="mrs_01_24033__en-us_topic_0000001219029253_ul187091039133214"><li id="mrs_01_24033__en-us_topic_0000001219029253_li1771073963219">You need to change the password of the created user, and then run the <strong id="mrs_01_24033__en-us_topic_0000001219029253_b9803240328470">kinit</strong> command to log in to the system again.</li><li id="mrs_01_24033__en-us_topic_0000001219029253_li187101839113218">In normal mode, you do not need to run the <strong id="mrs_01_24033__en-us_topic_0000001219029253_b1465716423489">kinit</strong> command.</li></ul>
</div></div>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li654313073616"><span>Use <strong id="mrs_01_24033__en-us_topic_0000001219029253_b1552914518302">spark-shell --master yarn-client</strong> to import Hudi packages to generate test data:</span><p><pre class="screen" id="mrs_01_24033__en-us_topic_0000001219029253_screen1572515110358">// Import required packages.
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
// Define the storage path and generate the test data.
val tableName = "hudi_cow_table"
val basePath = "hdfs://hacluster/tmp/hudi_cow_table"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))</pre>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li1916195433012"><span>Write data to the Hudi table in overwrite mode.</span><p><p id="mrs_01_24033__en-us_topic_0000001219029253_p158621412124813"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b386201211488">df.write.format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p9862912104818"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b8862131224820">options(getQuickstartWriteConfigs).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p686281213487"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b886218126483">option(PRECOMBINE_FIELD_OPT_KEY, "ts").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1486241219482"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b18862141219488">option(RECORDKEY_FIELD_OPT_KEY, "uuid").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p15862151224811"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b7862161244819">option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1486271234810"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1786215126483">option(TABLE_NAME, tableName).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p128621121486"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b286218120487">mode(Overwrite).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1286161254814"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b8861111217485">save(basePath)</strong></p>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li208375443617"><span>Query the Hudi table.</span><p><p id="mrs_01_24033__en-us_topic_0000001219029253_p479241816364">Register a temporary table and query the table.</p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1481750164814"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b17816503488">val roViewDF = spark.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p2816606483"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b08161606481">read.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p481614024811"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b16816170124817">format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p78161801488"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b68163094811">load(basePath + "/*/*/*/*")</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p19816130134815"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1081650104812">roViewDF.createOrReplaceTempView("hudi_ro_table")</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p68164004816"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b58168064815">spark.sql("select fare, begin_lon, begin_lat, ts from hudi_ro_table where fare &gt; 20.0").show()</strong></p>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li53341630153711"><span>Generate new data and update the Hudi table in append mode.</span><p><p id="mrs_01_24033__en-us_topic_0000001219029253_p17283058192515"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1696117555818">val updates = convertToStringList(dataGen.generateUpdates(10))</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p7832647164715"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1566375464712">val df = spark.read.json(spark.sparkContext.parallelize(updates, 1))</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1883224714717"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b36644542472">df.write.format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p683224716478"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b18666154144712">options(getQuickstartWriteConfigs).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1183244719478"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b126677541471">option(PRECOMBINE_FIELD_OPT_KEY, "ts").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p17832144704712"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b18668145416474">option(RECORDKEY_FIELD_OPT_KEY, "uuid").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p18321047164716"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1966925414717">option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p283214471472"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b14670115414719">option(TABLE_NAME, tableName).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p3832847154713"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b767195418475">mode(Append).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p4832947144719"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b7672145424720">save(basePath)</strong></p>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li1220334433810"><span>Query incremental data in the Hudi table.</span><p><ul id="mrs_01_24033__en-us_topic_0000001219029253_ul249493483710"><li id="mrs_01_24033__en-us_topic_0000001219029253_li104945347374">Reload data.<p id="mrs_01_24033__en-us_topic_0000001219029253_p169437513378"><a name="mrs_01_24033__en-us_topic_0000001219029253_li104945347374"></a><a name="en-us_topic_0000001219029253_li104945347374"></a><strong id="mrs_01_24033__en-us_topic_0000001219029253_b16763171883715">spark.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p194319516372"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b676531863717">read.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p6943185143718"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b876601816376">format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p9943257372"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1176781811372">load(basePath + "/*/*/*/*").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p19439543710"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1476818184378">createOrReplaceTempView("hudi_ro_table")</strong></p>
</li><li id="mrs_01_24033__en-us_topic_0000001219029253_li7475939123720">Perform the incremental query.<p id="mrs_01_24033__en-us_topic_0000001219029253_p20943105153715"><a name="mrs_01_24033__en-us_topic_0000001219029253_li7475939123720"></a><a name="en-us_topic_0000001219029253_li7475939123720"></a><strong id="mrs_01_24033__en-us_topic_0000001219029253_b17611143017373">val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_ro_table order by commitTime").map(k =&gt; k.getString(0)).take(50)</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p394312583711"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1614330143719">val beginTime = commits(commits.length - 2)</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p59438512376"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b5614143013372">val incViewDF = spark.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p494317517373"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b16615143033716">read.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p2943459376"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b26151030163715">format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p169431651378"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b66161730153716">option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p0942115143714"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b166165307375">option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p159425513377"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b76161730103711">load(basePath);</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p694225173719"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b461753012371">incViewDF.registerTempTable("hudi_incr_table")</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p49425516371"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b861723011374">spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare &gt; 20.0").show()</strong></p>
</li></ul>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li18176123364020"><span>Perform the point-in-time query.</span><p><p id="mrs_01_24033__en-us_topic_0000001219029253_p53794539371"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b7619163103819">val beginTime = "000"</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p2379353173715"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b262011313814">val endTime = commits(commits.length - 2)</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p93796531375"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1262117313389">val incViewDF = spark.read.format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p7379953103711"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b562220318382">option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p143791534371"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b462383173812">option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p14379153123712"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b146243323820">option(END_INSTANTTIME_OPT_KEY, endTime).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p737919537377"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b762416318386">load(basePath);</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p8378353123716"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b76251135386">incViewDF.registerTempTable("hudi_incr_table")</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1137885393710"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b66262383816">spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare &gt; 20.0").show()</strong></p>
</p></li><li id="mrs_01_24033__en-us_topic_0000001219029253_li351416371594"><span>Delete data.</span><p><ul id="mrs_01_24033__en-us_topic_0000001219029253_ul107433714381"><li id="mrs_01_24033__en-us_topic_0000001219029253_li37493743814">Prepare the data to be deleted.<p id="mrs_01_24033__en-us_topic_0000001219029253_p185411833183818"><a name="mrs_01_24033__en-us_topic_0000001219029253_li37493743814"></a><a name="en-us_topic_0000001219029253_li37493743814"></a><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1797144317386">val df = spark.sql("select uuid, partitionpath from hudi_ro_table limit 2")</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p85418338387"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b119730438389">val deletes = dataGen.generateDeletes(df.collectAsList())</strong></p>
</li><li id="mrs_01_24033__en-us_topic_0000001219029253_li4908144520387">Execute the deletion.<p id="mrs_01_24033__en-us_topic_0000001219029253_p115410338382"><a name="mrs_01_24033__en-us_topic_0000001219029253_li4908144520387"></a><a name="en-us_topic_0000001219029253_li4908144520387"></a><strong id="mrs_01_24033__en-us_topic_0000001219029253_b18441135713817">va</strong><strong id="mrs_01_24033__en-us_topic_0000001219029253_b13463125483810">l df = spark.read.json(spark.sparkContext.parallelize(deletes, 2));</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1054123383810"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b746514549380">df.write.format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p16541633203817"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b114661154123819">options(getQuickstartWriteConfigs).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p8541133319387"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b184671754143815">option(OPERATION_OPT_KEY,"delete").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1854123353818"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b846835415386">option(PRECOMBINE_FIELD_OPT_KEY, "ts").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p154116335381"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1246916544383">option(RECORDKEY_FIELD_OPT_KEY, "uuid").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1154120338384"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b147095414388">option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1954153315384"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b174701254153811">option(TABLE_NAME, tableName).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p3541143315387"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b16471254113814">mode(Append).</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p1754163323819"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b134729549386">save(basePath);</strong></p>
</li><li id="mrs_01_24033__en-us_topic_0000001219029253_li1849311012399">Query data again.<p id="mrs_01_24033__en-us_topic_0000001219029253_p554117334387"><a name="mrs_01_24033__en-us_topic_0000001219029253_li1849311012399"></a><a name="en-us_topic_0000001219029253_li1849311012399"></a><strong id="mrs_01_24033__en-us_topic_0000001219029253_b9447689393">val roViewDFAfterDelete = spark.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p135411133143814"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1644968123915">read.</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p95411533113812"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b6452186391">format("org.apache.hudi").</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p11541433123815"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b34547833920">load(basePath + "/*/*/*/*")</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p15417333381"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b74552853915">roViewDFAfterDelete.createOrReplaceTempView("hudi_ro_table")</strong></p>
<p id="mrs_01_24033__en-us_topic_0000001219029253_p19541113373813"><strong id="mrs_01_24033__en-us_topic_0000001219029253_b1045618133920">spark.sql("select uuid, partitionPath from hudi_ro_table").show()</strong></p>
</li></ul>
</p></li></ol>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_24025.html">Using Hudi</a></div>
</div>
</div>