Spark SQL supports rule-based optimization by default. However, the rule-based optimization cannot ensure that Spark selects the optimal query plan. Cost-Based Optimizer (CBO) is a technology that intelligently selects query plans for SQL statements. After CBO is enabled, the CBO optimizer performs a series of estimations based on the table and column statistics to select the optimal query plan.
Perform the following steps to enable CBO:
SQL commands are as follows (to be chosen as required):
ANALYZE TABLE src COMPUTE STATISTICS
This command generates sizeInBytes and rowCount.
When you use the ANALYZE statement to collect statistics, sizes of tables not from HDFS cannot be calculated.
ANALYZE TABLE src COMPUTE STATISTICS NOSCAN
This command generates only sizeInBytes. Compared with the originally generated sizeInBytes and rowCount if the sizeInBytes remains unchanged, rowCount (if any) reserves. Otherwise, rowCount is cleared.
ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS a, b, c
This command generates column statistics and updates table statistics for consistency. Statistics of complicated data types (such as Seq and Map) and HiveStringType cannot be generated.
This command displays xxx bytes and xxx rows in Statistics to indicate table-level statistics. You can also run the following command to display column statistics:
DESC FORMATTED src a
Limitation: The current statistics collection does not support statistics for partition levels for partitioned tables.
Parameter |
Description |
Default Value |
---|---|---|
spark.sql.cbo.enabled |
The switch to enable or disable CBO.
To enable this function, ensure that statistics of related tables and columns are generated. |
false |
spark.sql.cbo.joinReorder.enabled |
Specifies whether to automatically adjust the sequence of consecutive inner joins by using CBO.
To enable this function, ensure that statistics of related tables and columns are generated and CBO is enabled. |
false |
spark.sql.cbo.joinReorder.dp.threshold |
Specifies the threshold of the number of tables that the sequence of consecutive inner joins is automatically adjusted by CBO. If the threshold is exceeded, the sequence of joins is not adjusted. |
12 |