forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
107 lines
8.6 KiB
HTML
107 lines
8.6 KiB
HTML
<a name="mrs_01_24033"></a><a name="mrs_01_24033"></a>
|
|
|
|
<h1 class="topictitle1">Getting Started</h1>
|
|
<div id="body0000001128534151"><div class="section" id="mrs_01_24033__section665164316242"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_24033__p5148178123214">This section describes capabilities of Hudi using spark-shell. Using the Spark data source, this section describes how to insert and update a Hudi dataset of the default storage mode Copy-on Write (COW) tables based on code snippets. After each write operation, you will be introduced how to read snapshot and incremental data.</p>
|
|
</div>
|
|
<div class="section" id="mrs_01_24033__section192812104215"><h4 class="sectiontitle">Prerequisites</h4><ul id="mrs_01_24033__ul7364164620447"><li id="mrs_01_24033__li2815112015460">You have created a user and added the user to user groups <strong id="mrs_01_24033__b185096412813">hadoop</strong> (primary group) and <strong id="mrs_01_24033__b10500079814">hive</strong> on Manager.</li></ul>
|
|
</div>
|
|
<div class="section" id="mrs_01_24033__section13661165916315"><h4 class="sectiontitle">Procedure</h4><ol id="mrs_01_24033__ol18943144812358"><li id="mrs_01_24033__li20426610193114"><span>Download and install the Hudi client. For details, see <a href="mrs_01_2127.html">Installing a Client (Version 3.x or Later)</a>.</span><p><div class="note" id="mrs_01_24033__note1528652712326"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_24033__p22862273327">Currently, Hudi is integrated in Spark2x. You only need to download the Spark2x client on Manager. For example, the client installation directory is <strong id="mrs_01_24033__b4350105913495">/opt/client</strong>.</p>
|
|
</div></div>
|
|
</p></li><li id="mrs_01_24033__li6424125918379"><a name="mrs_01_24033__li6424125918379"></a><a name="li6424125918379"></a><span>Log in to the node where the client is installed as user <strong id="mrs_01_24033__b1337718421528">root</strong> and run the following command:</span><p><p id="mrs_01_24033__p6725162313386"><strong id="mrs_01_24033__b174082193119">cd /opt/client</strong></p>
|
|
</p></li><li id="mrs_01_24033__li7810145113516"><span>Run the following commands to load environment variables:</span><p><p id="mrs_01_24033__p168151820193616"><strong id="mrs_01_24033__b1338342213512">source bigdata_env</strong></p>
|
|
<p id="mrs_01_24033__p42231120113619"><strong id="mrs_01_24033__b12223132010362">source Hudi/component_env</strong></p>
|
|
<p id="mrs_01_24033__p1022311204363"><strong id="mrs_01_24033__b15938133716363">kinit </strong><em id="mrs_01_24033__i848712529462">Created user</em></p>
|
|
<div class="note" id="mrs_01_24033__note4405123718714"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="mrs_01_24033__ul187091039133214"><li id="mrs_01_24033__li1771073963219">You need to change the password of the created user, and then run the <strong id="mrs_01_24033__b9803240328470">kinit</strong> command to log in to the system again.</li><li id="mrs_01_24033__li187101839113218">In normal mode (Kerberos authentication disabled), you do not need to run the <strong id="mrs_01_24033__b1644312378321">kinit</strong> command.</li></ul>
|
|
</div></div>
|
|
</p></li><li id="mrs_01_24033__li654313073616"><a name="mrs_01_24033__li654313073616"></a><a name="li654313073616"></a><span>Use <strong id="mrs_01_24033__b1552914518302">spark-shell --master yarn-client</strong> to import Hudi packages to generate test data:</span><p><pre class="screen" id="mrs_01_24033__screen1572515110358">// Import required packages.
|
|
import org.apache.hudi.QuickstartUtils._
|
|
import scala.collection.JavaConversions._
|
|
import org.apache.spark.sql.SaveMode._
|
|
import org.apache.hudi.DataSourceReadOptions._
|
|
import org.apache.hudi.DataSourceWriteOptions._
|
|
import org.apache.hudi.config.HoodieWriteConfig._
|
|
// Define the table name and storage path to generate test data.
|
|
val tableName = "hudi_cow_table"
|
|
val basePath = "hdfs://hacluster/tmp/hudi_cow_table"
|
|
val dataGen = new DataGenerator
|
|
val inserts = convertToStringList(dataGen.generateInserts(10))
|
|
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))</pre>
|
|
</p></li><li id="mrs_01_24033__li1916195433012"><span>Write data to the Hudi table in overwrite mode.</span><p><pre class="screen" id="mrs_01_24033__screen8865155113014">df.write.format("org.apache.hudi").
|
|
options(getQuickstartWriteConfigs).
|
|
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
|
|
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
|
|
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
|
|
option(TABLE_NAME, tableName).
|
|
mode(Overwrite).
|
|
save(basePath)</pre>
|
|
</p></li><li id="mrs_01_24033__li208375443617"><span>Query the Hudi table.</span><p><p id="mrs_01_24033__p479241816364">Register a temporary table and query the table.</p>
|
|
<pre class="screen" id="mrs_01_24033__screen161561083120">val roViewDF = spark.
|
|
read.
|
|
format("org.apache.hudi").
|
|
load(basePath + "/*/*/*/*")
|
|
roViewDF.createOrReplaceTempView("hudi_ro_table")
|
|
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_ro_table where fare > 20.0").show()</pre>
|
|
</p></li><li id="mrs_01_24033__li53341630153711"><span>Generate new data and update the Hudi table in append mode.</span><p><pre class="screen" id="mrs_01_24033__screen1676171033113">val updates = convertToStringList(dataGen.generateUpdates(10))
|
|
val df = spark.read.json(spark.sparkContext.parallelize(updates, 1))
|
|
df.write.format("org.apache.hudi").
|
|
options(getQuickstartWriteConfigs).
|
|
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
|
|
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
|
|
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
|
|
option(TABLE_NAME, tableName).
|
|
mode(Append).
|
|
save(basePath)</pre>
|
|
</p></li><li id="mrs_01_24033__li1220334433810"><span>Query incremental data in the Hudi table.</span><p><ul id="mrs_01_24033__ul249493483710"><li id="mrs_01_24033__li104945347374">Reload data.<pre class="screen" id="mrs_01_24033__screen13861820163117">spark.
|
|
read.
|
|
format("org.apache.hudi").
|
|
load(basePath + "/*/*/*/*").
|
|
createOrReplaceTempView("hudi_ro_table")</pre>
|
|
</li><li id="mrs_01_24033__li7475939123720">Perform the incremental query.<pre class="screen" id="mrs_01_24033__screen1810923013317">val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50)
|
|
val beginTime = commits(commits.length - 2)
|
|
val incViewDF = spark.
|
|
read.
|
|
format("org.apache.hudi").
|
|
option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
|
|
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
|
|
load(basePath);
|
|
incViewDF.registerTempTable("hudi_incr_table")
|
|
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()</pre>
|
|
</li></ul>
|
|
</p></li><li id="mrs_01_24033__li18176123364020"><span>Perform the point-in-time query.</span><p><pre class="screen" id="mrs_01_24033__screen1471853933114">val beginTime = "000"
|
|
val endTime = commits(commits.length - 2)
|
|
val incViewDF = spark.read.format("org.apache.hudi").
|
|
option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
|
|
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
|
|
option(END_INSTANTTIME_OPT_KEY, endTime).
|
|
load(basePath);
|
|
incViewDF.registerTempTable("hudi_incr_table")
|
|
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()</pre>
|
|
</p></li><li id="mrs_01_24033__li351416371594"><span>Delete data.</span><p><ul id="mrs_01_24033__ul107433714381"><li id="mrs_01_24033__li37493743814">Prepare the data to be deleted.<pre class="screen" id="mrs_01_24033__screen22288495318">val df = spark.sql("select uuid, partitionpath from hudi_ro_table limit 2")
|
|
val deletes = dataGen.generateDeletes(df.collectAsList())</pre>
|
|
</li><li id="mrs_01_24033__li4908144520387">Execute the deletion.<pre class="screen" id="mrs_01_24033__screen9550125763111">val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2));
|
|
df.write.format("org.apache.hudi").
|
|
options(getQuickstartWriteConfigs).
|
|
option(OPERATION_OPT_KEY,"delete").
|
|
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
|
|
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
|
|
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
|
|
option(TABLE_NAME, tableName).
|
|
mode(Append).
|
|
save(basePath);</pre>
|
|
</li><li id="mrs_01_24033__li1849311012399">Query data again.<pre class="screen" id="mrs_01_24033__screen203246683211">val roViewDFAfterDelete = spark.
|
|
read.
|
|
format("org.apache.hudi").
|
|
load(basePath + "/*/*/*/*")
|
|
roViewDFAfterDelete.createOrReplaceTempView("hudi_ro_table")
|
|
spark.sql("select uuid, partitionPath from hudi_ro_table").show()</pre>
|
|
</li></ul>
|
|
</p></li></ol>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_24025.html">Using Hudi</a></div>
|
|
</div>
|
|
</div>
|
|
|