forked from docs/doc-exports
Reviewed-by: Kacur, Michal <michal.kacur@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
29 lines
6.2 KiB
HTML
29 lines
6.2 KiB
HTML
<a name="mrs_01_24164"></a><a name="mrs_01_24164"></a>
|
|
|
|
<h1 class="topictitle1">Metadata Table</h1>
|
|
<div id="body32001227"><ul id="mrs_01_24164__en-us_topic_0000001173631260_ul184515408169"><li id="mrs_01_24164__en-us_topic_0000001173631260_li1545204018164"><strong id="mrs_01_24164__en-us_topic_0000001173631260_b114001533440">Introduction</strong><p id="mrs_01_24164__en-us_topic_0000001173631260_p1399217194152">A metadata table is a special Hudi metadata table, which is hidden from users. The table stores metadata of a common Hudi table.</p>
|
|
<p id="mrs_01_24164__en-us_topic_0000001173631260_p699210190158">The metadata table is included in a common Hudi table and has a one-to-one mapping relationship with the Hudi table.</p>
|
|
</li></ul>
|
|
<ul id="mrs_01_24164__en-us_topic_0000001173631260_ul1677104217165"><li id="mrs_01_24164__en-us_topic_0000001173631260_li127713421169"><strong id="mrs_01_24164__en-us_topic_0000001173631260_b760743185019">Functions</strong><p id="mrs_01_24164__en-us_topic_0000001173631260_p8615165819149">Listing massive table partition files in HDFS consumes a large number of RPC requests, reducing HDFS throughput and affecting performance. This problem is more serious for object storage such as OBS. However, a query engine must go through the preceding step before a query.</p>
|
|
<p id="mrs_01_24164__en-us_topic_0000001173631260_p176954331513">Generally, partition information of the current partitioned table is stored in Hive MetaStore. If the partition size of a partitioned table reaches a certain level, the query engine performance deteriorates significantly when querying the partition information of the current table.</p>
|
|
</li></ul>
|
|
<ul id="mrs_01_24164__en-us_topic_0000001173631260_ul1178918438163"><li id="mrs_01_24164__en-us_topic_0000001173631260_li207891843201620"><strong id="mrs_01_24164__en-us_topic_0000001173631260_b199598171178">Mechanism</strong><p id="mrs_01_24164__en-us_topic_0000001173631260_p4801814161">A metadata table stores the partition information of the current Hudi table and the file information in the partition directory as the metadata information in a special Hudi table. In this way, when a query engine lists partition files of the table, the engine only needs access the metadata table. The RPC pressure of HDFS during query can be greatly reduced with a small volume of metadata information.</p>
|
|
<p id="mrs_01_24164__en-us_topic_0000001173631260_p36157585149">A metadata table is implemented using a Hudi MOR table. Therefore, it can be compacted, cleaned up, and incrementally updated. Unlike similar implementations in other projects, the file listing information is indexed as HFiles, which offers point-lookup performance to obtain partition file listings.</p>
|
|
</li></ul>
|
|
<ul id="mrs_01_24164__en-us_topic_0000001173631260_ul639424820198"><li id="mrs_01_24164__en-us_topic_0000001173631260_li11394148171916"><strong id="mrs_01_24164__en-us_topic_0000001173631260_b16922255525">How to Use</strong><p id="mrs_01_24164__en-us_topic_0000001173631260_p11344412102616">For Hive query, run <strong id="mrs_01_24164__en-us_topic_0000001173631260_b4945115112612">set hoodie.metadata.enable=true</strong>.</p>
|
|
<p id="mrs_01_24164__en-us_topic_0000001173631260_p17514111942611">For Spark SQL query, set <strong id="mrs_01_24164__en-us_topic_0000001173631260_b209184207347">--conf spark.hadoop.hoodie.metadata.enable</strong> to <strong id="mrs_01_24164__en-us_topic_0000001173631260_b158151574343">true</strong> when starting Spark SQL.</p>
|
|
<p id="mrs_01_24164__en-us_topic_0000001173631260_p17492137192018">When using Spark to write data, set <strong id="mrs_01_24164__en-us_topic_0000001173631260_b1888565253018">hoodie.metadata.enable</strong> in the <strong id="mrs_01_24164__en-us_topic_0000001173631260_b1356611173320">option</strong> parameter to <strong id="mrs_01_24164__en-us_topic_0000001173631260_b16524125993013">true</strong>.</p>
|
|
<p id="mrs_01_24164__en-us_topic_0000001173631260_p1647115702117">For details about more parameters, see <a href="mrs_01_24032.html">Hudi Configuration Reference</a> or visit Hudi official website <a href="http://hudi.apache.org/docs/configurations.html#metadata-config" target="_blank" rel="noopener noreferrer">http://hudi.apache.org/docs/configurations.html#metadata-config</a>.</p>
|
|
</li></ul>
|
|
<ul id="mrs_01_24164__en-us_topic_0000001173631260_ul17181183962119"><li id="mrs_01_24164__en-us_topic_0000001173631260_li15181173911212"><strong id="mrs_01_24164__en-us_topic_0000001173631260_b9885190183514">Performance improvement</strong><p id="mrs_01_24164__en-us_topic_0000001173631260_p492919178244">In the test on a large table with 250,000 partition files, the metadata table delivers two to three times speedup over parallelized listing done by Spark.</p>
|
|
<div class="caution" id="mrs_01_24164__en-us_topic_0000001173631260_note1644052183913"><span class="cautiontitle"><img src="public_sys-resources/caution_3.0-en-us.png"> </span><div class="cautionbody"><ol id="mrs_01_24164__en-us_topic_0000001173631260_ol171501662393"><li id="mrs_01_24164__en-us_topic_0000001173631260_li1315036193915"><a name="mrs_01_24164__en-us_topic_0000001173631260_li1315036193915"></a><a name="en-us_topic_0000001173631260_li1315036193915"></a>Do not manually operate a metadata table. Otherwise, data security may be affected.</li><li id="mrs_01_24164__en-us_topic_0000001173631260_li515019693913">To use metadata, you must enable metadata for each write operation to ensure data integrity.</li><li id="mrs_01_24164__en-us_topic_0000001173631260_li31507612398">When compaction and rollback are performed on tables of Hudi 0.8, data cannot be synchronized to the metadata table.</li><li id="mrs_01_24164__en-us_topic_0000001173631260_li2015016113912">When metadata is enabled during the clean operation, the metadata table can be updated.</li><li id="mrs_01_24164__en-us_topic_0000001173631260_li21501614398">When the number of commits reaches a specified value, compaction, clean, and archive operations are automatically triggered. Therefore, the operation in <a href="#mrs_01_24164__en-us_topic_0000001173631260_li1315036193915">1</a> is unnecessary.</li></ol>
|
|
</div></div>
|
|
</li></ul>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_24038.html">Data Management and Maintenance</a></div>
|
|
</div>
|
|
</div>
|
|
|