PostgreSQL-获取列具有最大值的行

PostgreSQL-获取列具有最大值的行,sql,postgresql,query-optimization,cbo,cost-based-optimizer,Sql,Postgresql,Query Optimization,Cbo,Cost Based Optimizer,我处理的是一个Postgres表(称为“lives”),其中包含带有time\u stamp、usr\u id、transaction\u id和lives\u remaining列的记录。我需要一个查询,该查询将为我提供每个usr\u id的最近剩余生命总数 有多个用户(不同的usr\U id) 时间戳不是唯一标识符:有时用户事件(表中按行)将与同一时间戳一起发生。 trans_id仅在非常小的时间范围内是唯一的:随着时间的推移,它会重复 剩余寿命(对于给定用户)可以随时间增加或减少 例如:

我处理的是一个Postgres表(称为“lives”),其中包含带有time\u stamp、usr\u id、transaction\u id和lives\u remaining列的记录。我需要一个查询,该查询将为我提供每个usr\u id的最近剩余生命总数

  • 有多个用户(不同的usr\U id)
  • 时间戳不是唯一标识符:有时用户事件(表中按行)将与同一时间戳一起发生。
  • trans_id仅在非常小的时间范围内是唯一的:随着时间的推移,它会重复
  • 剩余寿命(对于给定用户)可以随时间增加或减少
  • 例如:

    time_stamp|lives_remaining|usr_id|trans_id ----------------------------------------- 07:00 | 1 | 1 | 1 09:00 | 4 | 2 | 2 10:00 | 2 | 3 | 3 10:00 | 1 | 2 | 4 11:00 | 4 | 1 | 5 11:00 | 3 | 1 | 6 13:00 | 3 | 3 | 1 相反,我需要同时使用time_stamp(第一个)和trans_id(第二个)来标识正确的行。然后,我还需要将该信息从子查询传递到主查询,主查询将为相应行的其他列提供数据。这是我要处理的一个被破解的查询:

    SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM 
          (SELECT usr_id, max(time_stamp || '*' || trans_id) 
           AS max_timestamp_transid
           FROM lives GROUP BY usr_id ORDER BY usr_id) a 
    JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id 
    ORDER BY b.usr_id
    
    好吧,这很有效,但我不喜欢。它需要一个查询中的查询,一个自连接,在我看来,通过抓取MAX发现的具有最大时间戳和trans_id的行,它可能会简单得多。表“lives”有数千万行要解析,所以我希望这个查询尽可能快且高效。我对RDBM和Postgres尤其陌生,所以我知道我需要有效地利用适当的索引。我对如何优化有些迷茫

    我发现了类似的讨论。我可以执行与Oracle分析函数相当的某种类型的Postgres吗

    任何关于访问聚合函数(如MAX)使用的相关列信息、创建索引和创建更好查询的建议都将不胜感激

    请注意,您可以使用以下内容创建我的示例案例:

    create TABLE lives (time_stamp timestamp, lives_remaining integer, 
                        usr_id integer, trans_id integer);
    insert into lives values ('2000-01-01 07:00', 1, 1, 1);
    insert into lives values ('2000-01-01 09:00', 4, 2, 2);
    insert into lives values ('2000-01-01 10:00', 2, 3, 3);
    insert into lives values ('2000-01-01 10:00', 1, 2, 4);
    insert into lives values ('2000-01-01 11:00', 4, 1, 5);
    insert into lives values ('2000-01-01 11:00', 3, 1, 6);
    insert into lives values ('2000-01-01 13:00', 3, 3, 1);
    
    (usr\u id、time\u stamp、trans\u id)
    上创建索引将大大改进此查询


    您的表中应该始终有某种类型的
    主键。

    这里有另一种方法,它碰巧不使用相关的子查询或分组依据。我不是PostgreSQL性能调优方面的专家,所以我建议您尝试一下这个方法和其他人提供的解决方案,看看哪种方法更适合您

    SELECT l1.*
    FROM lives l1 LEFT OUTER JOIN lives l2
      ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp 
       OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
    WHERE l2.usr_id IS NULL
    ORDER BY l1.usr_id;
    
    选择l1*
    从生命l1左外连接生命l2
    ON(l1.usr\U id=l2.usr\U id)和(l1.time\U stamp

    我假设
    trans\u id
    至少在具有158k伪随机行(usr\u id均匀分布在0和10k之间,
    trans\u id
    均匀分布在0和30之间)的表上的
    时间戳的任何给定值上是唯一的

    在下面的查询成本中,我指的是Postgres基于成本的优化器的成本估算(带有Postgres的默认
    xxx\u成本
    值),这是所需I/O和CPU资源的加权函数估算;您可以通过启动PgAdminIII并在“查询/解释选项”设置为“分析”的查询上运行“查询/解释(F7)”来实现此目的

    • Quassnoy的查询估计成本为745k(!),并在1.3秒内完成(给定(
      usr\u id
      trans\u id
      time\u stamp
      )上的复合索引)
    • Bill的查询估计成本为93k,在2.9秒内完成(给定(
      usr\u id
      trans\u id
      )上的复合索引)
    • 下面的查询1的成本估算为16k,在800ms内完成(给定(
      usr\u id
      trans\u id
      时间戳
      )的复合索引)
    • 下面的查询2的成本估算为14k,在800ms内完成(给定(
      usr\u id
      EXTRACT(EPOCH FROM time\u stamp)
      trans\u id
      )上的复合函数索引)
      • 这是博士后特有的
    • 下面的查询#3(Postgres 8.4+)的成本估算和完成时间与查询#2相当(或优于查询#2)(在(
      usr\u id
      时间戳
      trans\u id
      )上给出了一个复合索引);它的优点是只扫描
      lives
      表一次,如果您临时增加(如果需要)以适应内存中的排序,它将是所有查询中最快的
    以上所有时间都包括检索完整的10k行结果集

    您的目标是最小的成本估算和最小的查询执行时间,重点是估算的成本。查询执行在很大程度上取决于运行时条件(例如,相关行是否已完全缓存在内存中),而成本估算则不是。另一方面,请记住,成本估算正是一个估算


    在专用数据库上无负载运行时(例如,在开发PC上玩pgAdminIII),可获得最佳查询执行时间。查询时间在生产过程中会根据实际机器负载/数据访问范围而有所不同。当一个查询出现的速度稍快时(我认为这里有一个主要问题:没有单调递增的“计数器”来保证给定的行比另一行发生的时间晚。例如:

    timestamp   lives_remaining   user_id   trans_id
    10:00       4                 3         5
    10:00       5                 3         6
    10:00       3                 3         1
    10:00       2                 3         2
    
    您无法根据此数据确定哪一个是最近的条目。它是第二个条目还是最后一个条目?没有可以应用于任何此数据的sort或max()函数为您提供正确答案

    提高时间戳的分辨率将是一个巨大的帮助。因为数据库引擎序列化请求,有了足够的分辨率,您可以保证没有两个时间戳是相同的

    或者,使用一个在很长时间内不会翻滚的trans_id。拥有一个翻滚的trans_id意味着,除非你做一些复杂的计算,否则你无法判断trans_id 6是否比trans_id 1更新。

    我喜欢
    SELECT  l.*
    FROM    (
            SELECT DISTINCT usr_id
            FROM   lives
            ) lo, lives l
    WHERE   l.ctid = (
            SELECT ctid
            FROM   lives li
            WHERE  li.usr_id = lo.usr_id
            ORDER BY
              time_stamp DESC, trans_id DESC
            LIMIT 1
            )
    
    SELECT l1.*
    FROM lives l1 LEFT OUTER JOIN lives l2
      ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp 
       OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
    WHERE l2.usr_id IS NULL
    ORDER BY l1.usr_id;
    
                  cost | time (dedicated machine) |     time (under load) |
    -------------------+--------------------------+-----------------------+
    some query A:   5k | (all data cached)  900ms | (less i/o)     1000ms |
    some query B:  50k | (all data cached)  900ms | (lots of i/o) 10000ms |
    
    -- incrementally narrow down the result set via inner joins
    --  the CBO may elect to perform one full index scan combined
    --  with cascading index lookups, or as hash aggregates terminated
    --  by one nested index lookup into lives - on my machine
    --  the latter query plan was selected given my memory settings and
    --  histogram
    SELECT
      l1.*
     FROM
      lives AS l1
     INNER JOIN (
        SELECT
          usr_id,
          MAX(time_stamp) AS time_stamp_max
         FROM
          lives
         GROUP BY
          usr_id
      ) AS l2
     ON
      l1.usr_id     = l2.usr_id AND
      l1.time_stamp = l2.time_stamp_max
     INNER JOIN (
        SELECT
          usr_id,
          time_stamp,
          MAX(trans_id) AS trans_max
         FROM
          lives
         GROUP BY
          usr_id, time_stamp
      ) AS l3
     ON
      l1.usr_id     = l3.usr_id AND
      l1.time_stamp = l3.time_stamp AND
      l1.trans_id   = l3.trans_max
    
    -- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
    -- this results in a single table scan and one nested index lookup into lives,
    --  by far the least I/O intensive operation even in case of great scarcity
    --  of memory (least reliant on cache for the best performance)
    SELECT
      l1.*
     FROM
      lives AS l1
     INNER JOIN (
       SELECT
         usr_id,
         MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
           AS compound_time_stamp
        FROM
         lives
        GROUP BY
         usr_id
      ) AS l2
    ON
      l1.usr_id = l2.usr_id AND
      EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
      l1.trans_id = l2.compound_time_stamp[2]
    
    -- use Window Functions
    -- performs a SINGLE scan of the table
    SELECT DISTINCT ON (usr_id)
      last_value(time_stamp) OVER wnd,
      last_value(lives_remaining) OVER wnd,
      usr_id,
      last_value(trans_id) OVER wnd
     FROM lives
     WINDOW wnd AS (
       PARTITION BY usr_id ORDER BY time_stamp, trans_id
       ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
     );
    
    timestamp   lives_remaining   user_id   trans_id
    10:00       4                 3         5
    10:00       5                 3         6
    10:00       3                 3         1
    10:00       2                 3         2
    
    SELECT * 
    FROM lives outer
    WHERE (usr_id, time_stamp, trans_id) IN (
        SELECT usr_id, time_stamp, trans_id
        FROM lives sq
        WHERE sq.usr_id = outer.usr_id
        ORDER BY trans_id, time_stamp
        LIMIT 1
    )
    
    SELECT (array_agg(tree.id ORDER BY tree_size.size)))[1]
    FROM tree JOIN forest ON (tree.forest = forest.id)
    GROUP BY forest.id
    
    SELECT DISTINCT ON (usr_id)
        time_stamp,
        lives_remaining,
        usr_id,
        trans_id
    FROM lives
    ORDER BY usr_id, time_stamp DESC, trans_id DESC;
    
    SELECT DISTINCT ON (location) location, time, report
        FROM weather_reports
        ORDER BY location, time DESC;
    
    SELECT t.*
    FROM
        (SELECT
            *,
            ROW_NUMBER() OVER(PARTITION BY usr_id ORDER BY time_stamp DESC) as r
        FROM lives) as t
    WHERE t.r = 1