Su, Xiaomeng fdd43c552e dli_umn_20240808

Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Su, Xiaomeng <suxiaomeng1@huawei.com>
Co-committed-by: Su, Xiaomeng <suxiaomeng1@huawei.com>

2024-08-09 11:00:57 +00:00

13 KiB

Raw Blame History

How Do I Troubleshoot Slow SQL Jobs?

If the job runs slowly, perform the following steps to find the causes and rectify the fault:

Possible Cause 1: Full GC

Check whether the problem is caused by FullGC.

Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
On the SQL Jobs page, locate the row that contains the target job and click More > View Log in the Operation column.
Obtain the folder of the archived logs in the OBS directory. The details are as follows:
- Spark SQL jobs:
  Locate the log folder whose name contains driver or container_ xxx _000001.
- Spark Jar jobs:
  The archive log folder of a Spark Jar job starts with batch.
Go to the archive log file directory and download the gc.log.* log file.
Open the downloaded gc.log.* log file, search for keyword Full GC, and check whether time records in the file are continuous and Full GC information is recorded repeatedly.

Cause locating and solution

Cause 1: There are too many small files in a table.

Log in to the DLI console and go to the SQL editor page. On the SQL Editor page, select the queue and database of the faulty job.
Run the following statement to check the number of files in the table and specify the table name.
```
select count(distinct fn)  FROM
(select input_file_name() as fn from table name) a
```
If there are too many small files, rectify the fault by referring to How Do I Merge Small Files?.

Cause 2: There is a broadcast table.

Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
In the Operation column of the job, click Spark UI.
On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
View the DAG of the job to check whether the BroadcastNestedLoopJoin node exists.
Figure 1 DAG
If the BroadcastNestedLoopJoin node exists, refer to Why Does a SQL Job That Has Join Operations Stay in the Running State? to rectify the fault.

Possible Cause 2: Data Skew

Check whether the problem is caused by data skew.

Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
In the Operation column of the job, click Spark UI.
On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
View the running status of the current stage in the Active Stage table on the displayed page. Click the hyperlink in the Description column.
View the Launch Time and Duration of each task.
Click Duration to sort tasks. Check whether the overall job duration is prolonged because a task has taken a long time.
According to Figure 2, when data skew occurs, the data volume of shuffle reads of a task is much greater than that of other tasks.
Figure 2 Data skew

Cause locating and solution

Shuffle data skew is caused by unbalanced number of key values in join.

Perform group by and count on a join to collect statistics on the number of key values of each join. The following is an example:
Join table lefttbl and table righttbl. num in the lefttbl table is the key value of the join. You can perform group by and count on lefttbl.num.
```
SELECT * FROM lefttbl a LEFT join righttbl b on a.num = b.int2;
SELECT count(1) as count,num from lefttbl  group by lefttbl.num ORDER BY count desc;
```
Use concat(cast(round(rand() * 999999999) as string) to generate a random number for each key value.
If the skew is serious and random numbers cannot be generated, see How Do I Do When Data Skew Occurs During the Execution of a SQL Job?

Parent topic: O&M Guide

13 KiB Raw Blame History

How Do I Troubleshoot Slow SQL Jobs?

Possible Cause 1: Full GC

Possible Cause 2: Data Skew

13 KiB

Raw Blame History