When an application runs a SQL statement to operate the database, a large number of subqueries are used because they are more clear than table join. Especially in complicated query statements, subqueries have more complete and independent semantics, which makes SQL statements clearer and easy to understand. Therefore, subqueries are widely used.
In GaussDB(DWS), subqueries can also be called sublinks based on the location of subqueries in SQL statements.
The sublinks commonly used in OLAP and HTAP are exist_sublink and any_sublink. The sublinks are pulled up by the optimization engine of GaussDB(DWS). Because of the flexible use of subqueries in SQL statements, complex subqueries may affect query performance. Subqueries are classified into non-correlated subqueries and correlated subqueries.
The execution of a subquery is independent from any attribute of outer queries. In this way, a subquery can be executed before outer queries.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | select t1.c1,t1.c2 from t1 where t1.c1 in ( select c2 from t2 where t2.c2 IN (2,3,4) ); QUERY PLAN --------------------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Hash Right Semi Join Hash Cond: (t2.c2 = t1.c1) -> Streaming(type: REDISTRIBUTE) Spawn on: All datanodes -> Seq Scan on t2 Filter: (c2 = ANY ('{2,3,4}'::integer[])) -> Hash -> Seq Scan on t1 (10 rows) |
The execution of a subquery depends on some attributes of outer queries which are used as AND conditions of the subquery. In the following example, t1.c1 in the t2.c1 = t1.c1 condition is a dependent attribute. Such a subquery depends on outer queries and needs to be executed once for each outer query.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | select t1.c1,t1.c2 from t1 where t1.c1 in ( select c2 from t2 where t2.c1 = t1.c1 AND t2.c2 in (2,3,4) ); QUERY PLAN ----------------------------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Seq Scan on t1 Filter: (SubPlan 1) SubPlan 1 -> Result Filter: (t2.c1 = t1.c1) -> Materialize -> Streaming(type: BROADCAST) Spawn on: All datanodes -> Seq Scan on t2 Filter: (c2 = ANY ('{2,3,4}'::integer[])) (12 rows) |
A subquery is pulled up to join with tables in outer queries, preventing the subquery from being converted into the combination of a subplan and broadcast. You can run the EXPLAIN statement to check whether a subquery is converted into the combination of a subplan and broadcast.
Example:
The WHERE clause must contain a column in the outer query. Other parts of the subquery cannot contain the column. Other restrictions are as follows:
The WHERE condition of the subquery must contain a column from the outer query. Equivalence comparison must be performed between this column and related columns in tables of the subquery. These conditions must be connected using AND. Other parts of the subquery cannot contain the column. Other restrictions are as follows:
1 2 3 | select * from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1 ); |
The following subquery cannot be pulled up because the subquery has no aggregation function.
1 2 3 | select * from t1 where c1 >( select t2.c1 from t2 where t2.c1=t1.c1 ); |
The following subquery cannot be pulled up because the subquery has two output columns:
1 2 3 | select * from t1 where (c1,c2) >( select max(t2.c1),min(t2.c2) from t2 where t2.c1=t1.c1 ); |
1 2 3 4 5 | select * from t3 where t3.c1=( select t1.c1 from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1 )); |
If another condition is added to the subquery in the previous example, the subquery cannot be pulled up because the subquery references to the column in the outer query. Example:
1 2 3 4 5 6 | select * from t3 where t3.c1=( select t1.c1 from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1 and t3.c1>t2.c2 )); |
If the WHERE condition contains a EXIST-related sublink connected by OR,
for example,
1 2 3 | select a, c from t1 where t1.a = (select avg(a) from t3 where t1.b = t3.b) or exists (select * from t4 where t1.c = t4.c); |
the process of pulling up such a sublink is as follows:
1 2 3 | select a, c from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b) where t3.b is not null or exists (select * from t4 where t1.c = t4.c); |
1 2 3 | select a, c from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b) left join (select t4.c from t4 group by t4.c) where t3.b is not null or t4.c is not null; |
Except the sublinks described above, all the other sublinks cannot be pulled up. In this case, a join subquery is planned as the combination of a subplan and broadcast. As a result, if tables in the subquery have a large amount of data, query performance may be poor.
If a correlated subquery joins with two tables in outer queries, the subquery cannot be pulled up. You need to change the parent query into a WITH clause and then perform the join.
Example:
1 2 | select distinct t1.a, t2.a from t1 left join t2 on t1.a=t2.a and not exists (select a,b from test1 where test1.a=t1.a and test1.b=t2.a); |
The parent query is changed into:
1 2 3 4 5 6 7 8 | with temp as ( select * from (select t1.a as a, t2.a as b from t1 left join t2 on t1.a=t2.a) ) select distinct a,b from temp where not exists (select a,b from test1 where temp.a=test1.a and temp.b=test1.b); |
Example:
1 2 3 4 | explain (costs off) select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2 from t1 where t1.c2 > 10; |
The execution plan is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | explain (costs off) select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2 from t1 where t1.c2 > 10; QUERY PLAN ------------------------------------------------------ Streaming (type: GATHER) Node/s: All datanodes -> Seq Scan on t1 Filter: (c2 > 10) SubPlan 1 -> Result Filter: (t1.c1 = t2.c1) -> Materialize -> Streaming(type: BROADCAST) Spawn on: All datanodes -> Seq Scan on t2 (11 rows) |
The correlated subquery is displayed in the target list (query return list). Values need to be returned even if the condition t1.c1=t2.c1 is not met. Therefore, use left outer join to join T1 and T2 so that SSQ can return padding values when the condition t1.c1=t2.c1 is not met.
ScalarSubQuery (SSQ) and Correlated-ScalarSubQuery (CSSQ) are described as follows:
The preceding SQL statement can be changed into:
1 2 3 4 5 6 7 | with ssq as ( select t2.c2 from t2 ) select ssq.c2, t1.c2 from t1 left join ssq on t1.c1 = ssq.c2 where t1.c2 > 10; |
The execution plan after the change is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 | QUERY PLAN ------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Hash Right Join Hash Cond: (t2.c2 = t1.c1) -> Streaming(type: REDISTRIBUTE) Spawn on: All datanodes -> Seq Scan on t2 -> Hash -> Seq Scan on t1 Filter: (c2 > 10) (10 rows) |
In the preceding example, the SSQ is pulled up to right join, preventing poor performance caused by the combination of a subplan and broadcast when the table (T2) in the subquery is too large.
Example:
1 2 3 | select (select count(*) from t2 where t2.c1=t1.c1) cnt, t1.c1, t3.c1 from t1,t3 where t1.c1=t3.c1 order by cnt, t1.c1; |
The execution plan is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | QUERY PLAN ------------------------------------------------------------------ Streaming (type: GATHER) Node/s: All datanodes -> Sort Sort Key: ((SubPlan 1)), t1.c1 -> Hash Join Hash Cond: (t1.c1 = t3.c1) -> Seq Scan on t1 -> Hash -> Seq Scan on t3 SubPlan 1 -> Aggregate -> Result Filter: (t2.c1 = t1.c1) -> Materialize -> Streaming(type: BROADCAST) Spawn on: All datanodes -> Seq Scan on t2 (17 rows) |
The correlated subquery is displayed in the target list (query return list). Values need to be returned even if the condition t1.c1=t2.c1 is not met. Therefore, use left outer join to join T1 and T2 so that SSQ can return padding values when the condition t1.c1=t2.c1 is not met. However, COUNT is used to ensure that 0 is returned when the condition is note met. Therefore, case-when NULL then 0 else count(*) can be used.
The preceding SQL statement can be changed into:
1 2 3 4 5 6 7 8 9 10 11 | with ssq as ( select count(*) cnt, c1 from t2 group by c1 ) select case when ssq.cnt is null then 0 else ssq.cnt end cnt, t1.c1, t3.c1 from t1 left join ssq on ssq.c1 = t1.c1,t3 where t1.c1 = t3.c1 order by ssq.cnt, t1.c1; |
The execution plan after the change is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | QUERY PLAN ----------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Sort Sort Key: (count(*)), t1.c1 -> Hash Join Hash Cond: (t1.c1 = t3.c1) -> Hash Left Join Hash Cond: (t1.c1 = t2.c1) -> Seq Scan on t1 -> Hash -> HashAggregate Group By Key: t2.c1 -> Seq Scan on t2 -> Hash -> Seq Scan on t3 (15 rows) |
1 2 3 | select t1.c1, t1.c2 from t1 where t1.c1 = (select agg() from t2.c2 > t1.c2); |
Nonequivalent subqueries cannot be pulled up. You can perform join twice (one CorrelationKey and one rownum self-join) to rewrite the statement.
You can rewrite the statement in either of the following ways:
1 2 3 4 5 6 7 | select t1.c1, t1.c2 from t1, ( select t1.rowid, agg() aggref from t1,t2 where t1.c2 > t2.c2 group by t1.rowid ) dt /* derived table */ where t1.rowid = dt.rowid AND t1.c1 = dt.aggref; |
1 2 3 4 5 6 7 8 9 10 | WITH dt as ( select t1.rowid, agg() aggref from t1,t2 where t1.c2 > t2.c2 group by t1.rowid ) select t1.c1, t1.c2 from t1, derived_table where t1.rowid = derived_table.rowid AND t1.c1 = derived_table.aggref; |
1. Change the base table to a replication table and create an index on the filter column.
1 2 3 | create table master_table (a int); create table sub_table(a int, b int); select a from master_table group by a having a in (select a from sub_table); |
In this example, a correlated subquery is contained. To improve the query performance, you can change sub_table to a replication table and create an index on the a column.
2. Modify the SELECT statement, change the subquery to a JOIN relationship between the primary table and the parent query, or modify the subquery to improve the query performance. Ensure that the subquery to be used is semantically correct.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | explain (costs off)select * from master_table as t1 where t1.a in (select t2.a from sub_table as t2 where t1.a = t2.b); QUERY PLAN ---------------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Seq Scan on master_table t1 Filter: (SubPlan 1) SubPlan 1 -> Result Filter: (t1.a = t2.b) -> Materialize -> Streaming(type: BROADCAST) Spawn on: All datanodes -> Seq Scan on sub_table t2 (11 rows) |
In the preceding example, a subplan is used. To remove the subplan, you can modify the statement as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 | explain(costs off) select * from master_table as t1 where exists (select t2.a from sub_table as t2 where t1.a = t2.b and t1.a = t2.a); QUERY PLAN -------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Hash Semi Join Hash Cond: (t1.a = t2.b) -> Seq Scan on master_table t1 -> Hash -> Streaming(type: REDISTRIBUTE) Spawn on: All datanodes -> Seq Scan on sub_table t2 (9 rows) |
In this way, the subplan is replaced by the semi-join between the two tables, greatly improving the execution efficiency.