Why cannot I query newly inserted data in a parquet Hive table using SparkSQL? This problem occurs in the following scenarios:
To improve Spark performance, parquet metadata is cached. When the parquet table is updated by Hive or another means, the cached metadata remains unchanged, resulting in SparkSQL failing to query the newly inserted data.
For a parquet Hive partition table, if the partition information remains unchanged after data is inserted, the cached metadata is not updated. As a result, the newly inserted data cannot be queried by SparkSQL.
To solve the query problem, update metadata before starting a Spark SQL query.
REFRESH TABLE table_name;
table_name indicates the name of the table to be updated. The table must exist. Otherwise, an error is reported.
When the query statement is executed, the latest inserted data can be obtained.
For details, visit https://spark.apache.org/docs/3.1.1/sql-programming-guide.html#metadata-refreshing.