DB2中WITH查询的SQL查询性能改进

DB2中WITH查询的SQL查询性能改进,sql,performance,db2,Sql,Performance,Db2,我在示例中给出的查询运行得非常慢。我的_任务表中已接近400万条记录 我们可以在这方面做一些性能改进吗 以下表为例 在这里,我放置了数字start_dt和end_dt,而不是时间戳格式 附加说明:如果有空的end_dt,则表示它是一个活动记录,由工作人员处理 T_ID |start_dt |end_dt |code |p_id -----|---------|-------|-----------|--- 1 |8 |4 |INPROGRESS |110

我在示例中给出的查询运行得非常慢。我的_任务表中已接近400万条记录

我们可以在这方面做一些性能改进吗

以下表为例

在这里,我放置了数字start_dt和end_dt,而不是时间戳格式

附加说明:如果有空的end_dt,则表示它是一个活动记录,由工作人员处理

T_ID |start_dt |end_dt |code       |p_id
-----|---------|-------|-----------|---
1    |8        |4      |INPROGRESS |110
1    |4        |       |ASSIGNED   |110
4    |10       |4      |INPROGRESS |110
4    |4        |       |ASSIGNED   |110
5    |4        |4      |INPROGRESS |110
6    |12       |12     |INPROGRESS |110
6    |8        |8      |ASSIGNED   |110
6    |8        |       |DONE       |110
2    |12       |12     |INPROGRESS |210
2    |8        |8      |ASSIGNED   |210
2    |8        |       |DONE       |210
3    |12       |12     |INPROGRESS |111
输出看起来像

P_ID |avg_bgn_diff |assigned |in_progress |completed | comp_diff
-----|-------------|---------|------------|----------|----------
110  | 4           |   2     |    1       |     1    |      10
210  | null        |   0     |    0       |     1    |      8
111  | null        |   0     |    1       |     0    |      null
输出说明:我用虚构的名称屏蔽了原始查询。表ref可能会被破坏,对此我深表歉意

我的任务表具有唯一的任务ID 我的_PEOPLE表是employee表 MY_TASK_REF表包含关于谁有什么任务的详细信息 当每个状态更改操作的结果进入任务表中的记录时,任务具有状态。如已分配、正在进行和已完成的雕像 现在,只要END_DT不存在,就表示活动记录 第一个输出字段avg_bgn_diff我们只想找到所有平均结束时间为空的“已分配”任务的平均时间 分配的“进行中”已完成字段表示每个员工在每个类别中有多少活动任务。 查找每位员工的平均薪酬差异完成时间。当记录进入程序时,员工开始处理。我们平均完成了今天完成的任务。我们得到了进程的开始日期和完成的开始日期。 我有以下疑问,

WITH a AS (
    SELECT
        t1.t_id AS t_id,
        t1.start_dt AS start_dt,
        t1.end_dt AS end_dt,
        t1.code AS code,
        t2.p_id AS p_id
    FROM
        my_task t2
        INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
        INNER JOIN my_people p1 ON t2.p_id = p1.p_id
    WHERE
        -- ignore DONE tasks
        t1.t_id NOT IN (
            SELECT t.t_id
            FROM my_task t
            WHERE t.code = 'DONE' AND trunc(t.execution_dt) < trunc(current_timestamp)
        )
        and p1.department_id = '1234' 
    ORDER BY p_id DESC
) SELECT
    d.p_id,
    d.avg_bgn_diff
    ,e.assigned
    ,e.in_progress
    ,e.completed
    ,g.comp_diff
  FROM
  `-- find average time for persons for diff ASSIGNMENT
    (
        SELECT c.p_id,AVG(c.bgn_diff) AS avg_bgn_diff
        FROM(
                SELECT b.p_id,timestampdiff(4,current_timestamp - a.start_dt) AS bgn_diff
                FROM ( SELECT p_id,t_id,start_dt FROM a WHERE end_dt IS NULL ) b
                LEFT OUTER JOIN  ( SELECT p_id, t_id,start_dt FROM a WHERE 
                     code = 'ASSIGNED' AND   end_dt IS NULL ) x ON x.p_id = b.p_id
            ) c  GROUP BY C.p_id
    ) d
    -- find count of each codes person has
    INNER JOIN (
        SELECT 
            p_id,
            SUM( CASE WHEN code = 'ASSIGNED' THEN 1 ELSE 0 END ) AS assigned,
            SUM( CASE WHEN code = 'INPROGRESS' THEN 1 ELSE 0 END ) AS in_progress,
            SUM( CASE WHEN code = 'DONE' AND trunc(start_dt) = trunc(current_timestamp)
                    THEN 1 ELSE 0 END ) AS completed
        FROM
            a where end_dt IS NULL
        GROUP BY p_id
    ) e on D.p_id=E.p_id 
    -- find total avg diff of entire task took to compelete.
    LEFT OUTER JOIN (
        SELECT F.p_id,AVG(f.bgn_diff) AS comp_diff
        FROM
            (
                SELECT a.p_id, timestampdiff(4,b.start_dt - a.start_dt) AS bgn_diff
                FROM (
                        SELECT p_id, t_id, start_dt FROM a WHERE code = 'INPROGRESS'
                    ) a
                    INNER JOIN (
                        SELECT p_id, t_id, start_dt FROM a
                        WHERE code = 'DONE' AND   trunc(start_dt) = trunc(current_timestamp)
                    ) b ON a.t_id = b.t_id
            ) f GROUP BY F.p_id
    ) g ON D.p_id=G.p_id
WITH
ur;
我们是否可以用不同的方式编写,从而提高性能

注意:索引出现在所有必要的列中


提前感谢。

尝试在第一次查询中删除按p_id DESC排序的订单,通常按顺序排序的成本非常高。在第一个查询中,notin似乎正在查看同一个基表my_task,因此我建议只在WHERE子句中输入过滤器

WITH a AS (
SELECT
    t1.t_id AS t_id,
    t1.start_dt AS start_dt,
    t1.end_dt AS end_dt,
    t1.code AS code,
    t2.p_id AS p_id
FROM
    my_task t2
    INNER JOIN my_task_ref t1 ON t1.t_id = t2.t_id
    INNER JOIN my_people p1 ON t2.p_id = p1.p_id
WHERE
    -- ignore DONE tasks
    t2.code <> 'DONE' AND trunc(t2.execution_dt) < trunc(current_timestamp)
    and p1.department_id = '1234' )
可能会变成

SELECT a.p_id,AVG(timestampdiff(4,current_timestamp - a.start_dt)) AS 
avg_bgn_diff
FROM a
WHERE end_dt IS NULL OR (code = 'ASSIGNED' AND end_dt IS NULL )
GROUP BY a.p_id

如果您提供一个查询解释计划、一个索引列表,也许还可以更好地解释您要做的事情,并更正表引用c的语法错误,那么我们当然可以做得更好,但是这个版本的查询可能能够稍微加快速度

请注意整个评论

WITH Incomplete_Task AS (SELECT My_Task_Ref.t_id,
                                My_Task_Ref.start_dt, My_Task_Ref.end_dt,
                                My_Task_Ref.code,
                                Task_A.p_id
                         FROM My_Task AS Task_A
                         JOIN My_Task_Ref
                           ON My_Task_Ref.t_id = Task_A.t_id
                         JOIN My_People
                           ON My_People.p_id = My_Task_Ref.p_id
                              AND My_People.department_id = '1234'
                         -- NOT IN should be fine, I just default to NOT EXISTS
                         WHERE NOT EXISTS (SELECT 1
                                           FROM My_Task AS Task_B
                                           WHERE Task_B.t_id = Task_A.t_id
                                           AND Task_B.code = 'DONE'
                                           -- Calling a function on a column can 
                                           -- cause indices to be ignored
                                           AND Task_B.execution_dt < TIMESTAMP(CURRENT_DATE)))

SELECT Average_Time_And_Code_Count.p_id,
       Average_Time_And_Code_Count.average_begin_difference,
       COALESCE(Average_Time_And_Code_Count.assigned, 0),
       COALESCE(Average_Time_And_Code_Count.in_progress, 0),
       COALESCE(Average_Time_And_Code_Count.completed, 0),
       Average_Complete_Time.average_complete_difference
FROM (SELECT p_id,
             -- The join you had previously was almost certainly duplicating 
             -- some rows, distorting the results.
             AVG(CASE WHEN code = 'ASSIGNED' 
                      -- TIMESTAMPDIFF works off an estimate, and will be wrong
                      -- if a task takes more than a month.
                      THEN TIMESTAMPDIFF(4, CURRENT_TIMESTAMP - A.start_dt) END) AS average_begin_difference,
             SUM(CASE WHEN code = 'ASSIGNED' 
                               THEN 1 END) AS assigned,
             SUM(CASE WHEN code = 'INPROGRESS' 
                               THEN 1 END) AS in_progress,
             SUM(CASE WHEN code = 'DONE' 
                                    AND start_dt >= TIMESTAMP(CURRENT_DATE) 
                               THEN 1 END) AS completed
      FROM Filtered_Task
      WHERE end_dt IS NULL
      GROUP BY p_id) AS Average_Time_And_Code_Count
-- I'm not convinced this measures what you think it does,
-- but I'm not sure what it is you think you _are_ measuring....
LEFT JOIN (SELECT p_id, TIMESTAMPDIFF(4, Done.start_dt - InProgress.start_dt) AS average_complete_difference
           FROM Filtered_Task AS InProgress
           JOIN Filtered_Task AS Done
             ON InProgress.t_id = Done.t_id
                AND Done.code = 'DONE'
                AND Done.start_dt >= TIMESTAMP(CURRENT_DATE)
           WHERE InProgress.code = 'INPROGRESS') AS Average_Complete_Time
       ON Average_Complete_Time.p_id = Averate_Time_And_Code_Count.p_id

首先,尝试用左键替换“不在”join@DanielMarcus我有这个想法。还有其他更改吗?为什么有这么多嵌套查询?我觉得很多都可以用左连接重写——例如,在最后,您从表“a”中选择了三次,您应该能够执行一次选择,并在以下情况下使用条件逻辑needed@DanielMarcus我想了一会儿,但没有想出一个完整的查询与左连接。如果你能为我演示一下如何为写这部分的不同任务的人员找到平均时间,请解释一下你的数据集如何与你的查询结果相匹配-我看到了很多不一致之处。你的建议1不可能,因为我想删除所有已完成的任务记录当前时间戳。虽然您的建议仅删除已完成的记录,但并非全部。@JBaba如果您想在问题中描述某些内容,请不要让人们猜测-如果您忘记指定某些内容,这不是试图帮助您的人的错@发条鼠标我已经为output.TIMESTAMPDIFF添加了详细的解释。TIMESTAMPDIFF根据估计值工作,如果一项任务需要一个月以上的时间,它将是错误的。这个问题的解决方案是什么?@JBaba-我要给你重定向。这一个涵盖小时,但你应该能够将其转换为分钟。请注意,这两个时间戳必须在同一时区,并且必须在一个没有DST的时区,否则您将不得不首先进行大量额外的复杂数学运算。就个人而言,我建议创建一个函数来为您总结数学——TIMESTAMP_DIFFunit、start、end。
WITH Incomplete_Task AS (SELECT My_Task_Ref.t_id,
                                My_Task_Ref.start_dt, My_Task_Ref.end_dt,
                                My_Task_Ref.code,
                                Task_A.p_id
                         FROM My_Task AS Task_A
                         JOIN My_Task_Ref
                           ON My_Task_Ref.t_id = Task_A.t_id
                         JOIN My_People
                           ON My_People.p_id = My_Task_Ref.p_id
                              AND My_People.department_id = '1234'
                         -- NOT IN should be fine, I just default to NOT EXISTS
                         WHERE NOT EXISTS (SELECT 1
                                           FROM My_Task AS Task_B
                                           WHERE Task_B.t_id = Task_A.t_id
                                           AND Task_B.code = 'DONE'
                                           -- Calling a function on a column can 
                                           -- cause indices to be ignored
                                           AND Task_B.execution_dt < TIMESTAMP(CURRENT_DATE)))

SELECT Average_Time_And_Code_Count.p_id,
       Average_Time_And_Code_Count.average_begin_difference,
       COALESCE(Average_Time_And_Code_Count.assigned, 0),
       COALESCE(Average_Time_And_Code_Count.in_progress, 0),
       COALESCE(Average_Time_And_Code_Count.completed, 0),
       Average_Complete_Time.average_complete_difference
FROM (SELECT p_id,
             -- The join you had previously was almost certainly duplicating 
             -- some rows, distorting the results.
             AVG(CASE WHEN code = 'ASSIGNED' 
                      -- TIMESTAMPDIFF works off an estimate, and will be wrong
                      -- if a task takes more than a month.
                      THEN TIMESTAMPDIFF(4, CURRENT_TIMESTAMP - A.start_dt) END) AS average_begin_difference,
             SUM(CASE WHEN code = 'ASSIGNED' 
                               THEN 1 END) AS assigned,
             SUM(CASE WHEN code = 'INPROGRESS' 
                               THEN 1 END) AS in_progress,
             SUM(CASE WHEN code = 'DONE' 
                                    AND start_dt >= TIMESTAMP(CURRENT_DATE) 
                               THEN 1 END) AS completed
      FROM Filtered_Task
      WHERE end_dt IS NULL
      GROUP BY p_id) AS Average_Time_And_Code_Count
-- I'm not convinced this measures what you think it does,
-- but I'm not sure what it is you think you _are_ measuring....
LEFT JOIN (SELECT p_id, TIMESTAMPDIFF(4, Done.start_dt - InProgress.start_dt) AS average_complete_difference
           FROM Filtered_Task AS InProgress
           JOIN Filtered_Task AS Done
             ON InProgress.t_id = Done.t_id
                AND Done.code = 'DONE'
                AND Done.start_dt >= TIMESTAMP(CURRENT_DATE)
           WHERE InProgress.code = 'INPROGRESS') AS Average_Complete_Time
       ON Average_Complete_Time.p_id = Averate_Time_And_Code_Count.p_id