forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
23 lines
2.9 KiB
HTML
23 lines
2.9 KiB
HTML
<a name="mrs_01_24491"></a><a name="mrs_01_24491"></a>
|
|
|
|
<h1 class="topictitle1">Why Cannot I Query Newly Inserted Data in an ORC Hive Table Using Spark SQL?</h1>
|
|
<div id="body0000001252294421"><div class="section" id="mrs_01_24491__section1938124410304"><h4 class="sectiontitle">Question</h4><p id="mrs_01_24491__p7179131610522">Why cannot I query newly inserted data in an ORC Hive table using Spark SQL? This problem occurs in the following scenarios:</p>
|
|
<ul id="mrs_01_24491__ul1170492383113"><li id="mrs_01_24491__li1370416233311">For partitioned tables and non-partitioned tables, after data is inserted on the Hive client, the latest inserted data cannot be queried using Spark SQL.</li><li id="mrs_01_24491__li1193410253316">After data is inserted into a partitioned table using Spark SQL, if the partition information remains unchanged, the newly inserted data cannot be queried using Spark SQL.</li></ul>
|
|
</div>
|
|
<div class="section" id="mrs_01_24491__section18523211173115"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_24491__p1537884213523">To improve Spark performance, ORC metadata is cached. When the ORC table is updated by Hive or another means, the cached metadata remains unchanged, resulting in Spark SQL failing to query the newly inserted data.</p>
|
|
<p id="mrs_01_24491__p1237874265213">For an ORC Hive partition table, if the partition information remains unchanged after data is inserted, the cached metadata is not updated. As a result, the newly inserted data cannot be queried by Spark SQL.</p>
|
|
<p id="mrs_01_24491__p14379104214523"><strong id="mrs_01_24491__b16594145493115">Solution</strong></p>
|
|
<ol id="mrs_01_24491__ol4952195717319"><li id="mrs_01_24491__li19952857133118">To solve the query problem, update metadata before starting a Spark SQL query.<p id="mrs_01_24491__p1237974212527"><a name="mrs_01_24491__li19952857133118"></a><a name="li19952857133118"></a><strong id="mrs_01_24491__b156532591322">REFRESH TABLE</strong><em id="mrs_01_24491__i1224716023317"> t</em><em id="mrs_01_24491__i71017563329">able_name</em><strong id="mrs_01_24491__b134551355163216">;</strong></p>
|
|
<p id="mrs_01_24491__p14379174245216"><i><span class="varname" id="mrs_01_24491__varname740345910115230">table_name</span></i> indicates the name of the table to be updated. The table must exist. Otherwise, an error is reported.</p>
|
|
<p id="mrs_01_24491__p15379164218528">When the query statement is executed, the latest inserted data can be obtained.</p>
|
|
</li><li id="mrs_01_24491__li1083941203217">Run the following command to disable Spark optimization when using Spark:<p id="mrs_01_24491__p237914218524"><a name="mrs_01_24491__li1083941203217"></a><a name="li1083941203217"></a><strong id="mrs_01_24491__b13813143013217">set spark.sql.hive.convertMetastoreOrc=false;</strong></p>
|
|
</li></ol>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_2022.html">Spark SQL and DataFrame</a></div>
|
|
</div>
|
|
</div>
|
|
|