forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Su, Xiaomeng <suxiaomeng1@huawei.com> Co-committed-by: Su, Xiaomeng <suxiaomeng1@huawei.com>
13 KiB
13 KiB
How Do I Troubleshoot Slow SQL Jobs?
If the job runs slowly, perform the following steps to find the causes and rectify the fault:
Possible Cause 1: Full GC
Check whether the problem is caused by FullGC.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click More > View Log in the Operation column.
- Obtain the folder of the archived logs in the OBS directory. The details are as follows:
- Go to the archive log file directory and download the gc.log.* log file.
- Open the downloaded gc.log.* log file, search for keyword Full GC, and check whether time records in the file are continuous and Full GC information is recorded repeatedly.
Cause locating and solution
Cause 1: There are too many small files in a table.
- Log in to the DLI console and go to the SQL editor page. On the SQL Editor page, select the queue and database of the faulty job.
- Run the following statement to check the number of files in the table and specify the table name.
select count(distinct fn) FROM (select input_file_name() as fn from table name) a
- If there are too many small files, rectify the fault by referring to How Do I Merge Small Files?.
Cause 2: There is a broadcast table.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click
to view the job details and obtain the job ID.
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
- View the DAG of the job to check whether the BroadcastNestedLoopJoin node exists.
- If the BroadcastNestedLoopJoin node exists, refer to Why Does a SQL Job That Has Join Operations Stay in the Running State? to rectify the fault.
Possible Cause 2: Data Skew
Check whether the problem is caused by data skew.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click
to view the job details and obtain the job ID.
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
- View the running status of the current stage in the Active Stage table on the displayed page. Click the hyperlink in the Description column.
- View the Launch Time and Duration of each task.
- Click Duration to sort tasks. Check whether the overall job duration is prolonged because a task has taken a long time.According to Figure 2, when data skew occurs, the data volume of shuffle reads of a task is much greater than that of other tasks.
Cause locating and solution
Shuffle data skew is caused by unbalanced number of key values in join.
- Perform group by and count on a join to collect statistics on the number of key values of each join. The following is an example:
Join table lefttbl and table righttbl. num in the lefttbl table is the key value of the join. You can perform group by and count on lefttbl.num.
SELECT * FROM lefttbl a LEFT join righttbl b on a.num = b.int2; SELECT count(1) as count,num from lefttbl group by lefttbl.num ORDER BY count desc;
- Use concat(cast(round(rand() * 999999999) as string) to generate a random number for each key value.
- If the skew is serious and random numbers cannot be generated, see How Do I Do When Data Skew Occurs During the Execution of a SQL Job?
Parent topic: O&M Guide