Theses hints specify redistribution keys containing skew data and skew values, and are used to optimize redistribution involving Join or HashAgg.
1 | skew(table (column) [(value)]) |
1 | skew((join_rel) (column) [(value)]) |
Example:
Each skew hint describes the skew information of one table relationship. To describe the skews of multiple table relationships in a query, specify multiple skew hints.
Skew hints have the following formats:
Description: The v1 value in the c1 column of the t table relationship causes skew in query execution.
Description: Values including v1, v2, and v3 in the c1 column of the t table relationship cause skew in query execution.
Description: The v1 value in the c1 column and the v2 value in the c2 column of the t table relationship cause skew in query execution.
Description: Values including v1, v3, and v5 in the c1 column and values including v2, v4, and v6 in the c2 column of the t table relationship cause skew in query execution.
In the last format, parentheses for skew value groups can be omitted, for example, skew(t (c1 c2) (v1 v2 v3 v4 v5 v6 ...)). In a skew hint, either use parentheses for all skew value groups or for none of them.
Otherwise, a syntax error will be generated. For example, skew(t (c1 c2) (v1 v2 v3 v4 (v5 v6) ...)) will generate an error.
If data skew does not occur in base tables but in an intermediate result during query execution, specify skew hints of the intermediate result to solve the skew. The format is skew((t1 t2) (c1) (v1)).
Description: Data skew occurs after the table relationships t1 and t2 are joined. The c1 column of the t1 table contains skew data and its skew value is v1.
c1 can exist only in a table relationship of join_rel. If there is another column having the same name, use aliases to avoid ambiguity.
Specify single-table skew.
For example, the original query is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | explain with customer_total_return as (select sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_FEE) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year =2000 group by sr_customer_sk ,sr_store_sk) select c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'NM' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100; |
Specify the hints of HashAgg in the inner with clause and of the outer Hash Join. The query containing hints is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | explain with customer_total_return as (select /*+ skew(store_returns(sr_store_sk sr_customer_sk)) */sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_FEE) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year =2000 group by sr_customer_sk ,sr_store_sk) select /*+ skew(ctr1(ctr_customer_sk)(11))*/ c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'NM' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100; |
The hints indicate that the group by in the inner with clause contains skew data during redistribution by HashAgg, corresponding to the original Hash Agg operators 10 and 21; and that the ctr_customer_sk column in the outer ctr1 table contains skew data during redistribution by Hash Join, corresponding to operator 6 in the original plan. The optimized plan is as follows:
To solve data skew in the redistribution, Hash Agg is changed to double-level Agg operators and the redistribution operators used by Hash Join are changed in the optimized plan.
For example, the original query and its plan are as follows:
1 | explain select count(*) from store_sales_1 group by round(ss_list_price); |
Columns in hints do not support expressions. To specify hints, rewrite the query as several subqueries. The rewritten query and its plan are as follows:
1 2 3 4 5 | explain select count(*) from (select round(ss_list_price),ss_hdemo_sk from store_sales_1)tmp(a,ss_hdemo_sk) group by a; |
Ensure that the service logic is not changed during the rewriting.
Specify hints in the rewritten query as follows:
1 2 3 4 5 | explain select /*+ skew(tmp(a)) */ count(*) from (select round(ss_list_price),ss_hdemo_sk from store_sales_1)tmp(a,ss_hdemo_sk) group by a; |
The plan shows that after Hash Agg is changed to double-layer Agg operators, redistributed data is greatly reduced and redistribution time shortened.
You can specify hints in columns in a subquery, for example:
1 2 3 4 5 | explain select /*+ skew(tmp(b)) */ count(*) from (select round(ss_list_price) b,ss_hdemo_sk from store_sales_1)tmp(a,ss_hdemo_sk) group by a; |