PostgreSQL-获取列具有最大值的行
我处理的是一个Postgres表(称为“lives”),其中包含带有time\u stamp、usr\u id、transaction\u id和lives\u remaining列的记录。我需要一个查询,该查询将为我提供每个usr\u id的最近剩余生命总数PostgreSQL-获取列具有最大值的行,sql,postgresql,query-optimization,cbo,cost-based-optimizer,Sql,Postgresql,Query Optimization,Cbo,Cost Based Optimizer,我处理的是一个Postgres表(称为“lives”),其中包含带有time\u stamp、usr\u id、transaction\u id和lives\u remaining列的记录。我需要一个查询,该查询将为我提供每个usr\u id的最近剩余生命总数 有多个用户(不同的usr\U id) 时间戳不是唯一标识符:有时用户事件(表中按行)将与同一时间戳一起发生。 trans_id仅在非常小的时间范围内是唯一的:随着时间的推移,它会重复 剩余寿命(对于给定用户)可以随时间增加或减少 例如:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp || '*' || trans_id)
AS max_timestamp_transid
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id
ORDER BY b.usr_id
好吧,这很有效,但我不喜欢。它需要一个查询中的查询,一个自连接,在我看来,通过抓取MAX发现的具有最大时间戳和trans_id的行,它可能会简单得多。表“lives”有数千万行要解析,所以我希望这个查询尽可能快且高效。我对RDBM和Postgres尤其陌生,所以我知道我需要有效地利用适当的索引。我对如何优化有些迷茫
我发现了类似的讨论。我可以执行与Oracle分析函数相当的某种类型的Postgres吗
任何关于访问聚合函数(如MAX)使用的相关列信息、创建索引和创建更好查询的建议都将不胜感激
请注意,您可以使用以下内容创建我的示例案例:
create TABLE lives (time_stamp timestamp, lives_remaining integer,
usr_id integer, trans_id integer);
insert into lives values ('2000-01-01 07:00', 1, 1, 1);
insert into lives values ('2000-01-01 09:00', 4, 2, 2);
insert into lives values ('2000-01-01 10:00', 2, 3, 3);
insert into lives values ('2000-01-01 10:00', 1, 2, 4);
insert into lives values ('2000-01-01 11:00', 4, 1, 5);
insert into lives values ('2000-01-01 11:00', 3, 1, 6);
insert into lives values ('2000-01-01 13:00', 3, 3, 1);
在(usr\u id、time\u stamp、trans\u id)
上创建索引将大大改进此查询
您的表中应该始终有某种类型的
主键。这里有另一种方法,它碰巧不使用相关的子查询或分组依据。我不是PostgreSQL性能调优方面的专家,所以我建议您尝试一下这个方法和其他人提供的解决方案,看看哪种方法更适合您
SELECT l1.*
FROM lives l1 LEFT OUTER JOIN lives l2
ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp
OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
WHERE l2.usr_id IS NULL
ORDER BY l1.usr_id;
选择l1*
从生命l1左外连接生命l2
ON(l1.usr\U id=l2.usr\U id)和(l1.time\U stamp
我假设trans\u id
至少在具有158k伪随机行(usr\u id均匀分布在0和10k之间,trans\u id
均匀分布在0和30之间)的表上的时间戳的任何给定值上是唯一的
在下面的查询成本中,我指的是Postgres基于成本的优化器的成本估算(带有Postgres的默认xxx\u成本
值),这是所需I/O和CPU资源的加权函数估算;您可以通过启动PgAdminIII并在“查询/解释选项”设置为“分析”的查询上运行“查询/解释(F7)”来实现此目的
- Quassnoy的查询估计成本为745k(!),并在1.3秒内完成(给定(
usr\u id
,trans\u id
,time\u stamp
)上的复合索引)
- Bill的查询估计成本为93k,在2.9秒内完成(给定(
usr\u id
,trans\u id
)上的复合索引)
- 下面的查询1的成本估算为16k,在800ms内完成(给定(
usr\u id
,trans\u id
,时间戳
)的复合索引)
- 下面的查询2的成本估算为14k,在800ms内完成(给定(
usr\u id
,EXTRACT(EPOCH FROM time\u stamp)
,trans\u id
)上的复合函数索引)
- 这是博士后特有的
- 下面的查询#3(Postgres 8.4+)的成本估算和完成时间与查询#2相当(或优于查询#2)(在(
usr\u id
,时间戳
,trans\u id
)上给出了一个复合索引);它的优点是只扫描lives
表一次,如果您临时增加(如果需要)以适应内存中的排序,它将是所有查询中最快的
以上所有时间都包括检索完整的10k行结果集
您的目标是最小的成本估算和最小的查询执行时间,重点是估算的成本。查询执行在很大程度上取决于运行时条件(例如,相关行是否已完全缓存在内存中),而成本估算则不是。另一方面,请记住,成本估算正是一个估算
在专用数据库上无负载运行时(例如,在开发PC上玩pgAdminIII),可获得最佳查询执行时间。查询时间在生产过程中会根据实际机器负载/数据访问范围而有所不同。当一个查询出现的速度稍快时(我认为这里有一个主要问题:没有单调递增的“计数器”来保证给定的行比另一行发生的时间晚。例如:
timestamp lives_remaining user_id trans_id
10:00 4 3 5
10:00 5 3 6
10:00 3 3 1
10:00 2 3 2
您无法根据此数据确定哪一个是最近的条目。它是第二个条目还是最后一个条目?没有可以应用于任何此数据的sort或max()函数为您提供正确答案
提高时间戳的分辨率将是一个巨大的帮助。因为数据库引擎序列化请求,有了足够的分辨率,您可以保证没有两个时间戳是相同的
或者,使用一个在很长时间内不会翻滚的trans_id。拥有一个翻滚的trans_id意味着,除非你做一些复杂的计算,否则你无法判断trans_id 6是否比trans_id 1更新。我喜欢
SELECT l.*
FROM (
SELECT DISTINCT usr_id
FROM lives
) lo, lives l
WHERE l.ctid = (
SELECT ctid
FROM lives li
WHERE li.usr_id = lo.usr_id
ORDER BY
time_stamp DESC, trans_id DESC
LIMIT 1
)
SELECT l1.*
FROM lives l1 LEFT OUTER JOIN lives l2
ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp
OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
WHERE l2.usr_id IS NULL
ORDER BY l1.usr_id;
cost | time (dedicated machine) | time (under load) |
-------------------+--------------------------+-----------------------+
some query A: 5k | (all data cached) 900ms | (less i/o) 1000ms |
some query B: 50k | (all data cached) 900ms | (lots of i/o) 10000ms |
-- incrementally narrow down the result set via inner joins
-- the CBO may elect to perform one full index scan combined
-- with cascading index lookups, or as hash aggregates terminated
-- by one nested index lookup into lives - on my machine
-- the latter query plan was selected given my memory settings and
-- histogram
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(time_stamp) AS time_stamp_max
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
l1.time_stamp = l2.time_stamp_max
INNER JOIN (
SELECT
usr_id,
time_stamp,
MAX(trans_id) AS trans_max
FROM
lives
GROUP BY
usr_id, time_stamp
) AS l3
ON
l1.usr_id = l3.usr_id AND
l1.time_stamp = l3.time_stamp AND
l1.trans_id = l3.trans_max
-- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
-- this results in a single table scan and one nested index lookup into lives,
-- by far the least I/O intensive operation even in case of great scarcity
-- of memory (least reliant on cache for the best performance)
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
AS compound_time_stamp
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
l1.trans_id = l2.compound_time_stamp[2]
-- use Window Functions
-- performs a SINGLE scan of the table
SELECT DISTINCT ON (usr_id)
last_value(time_stamp) OVER wnd,
last_value(lives_remaining) OVER wnd,
usr_id,
last_value(trans_id) OVER wnd
FROM lives
WINDOW wnd AS (
PARTITION BY usr_id ORDER BY time_stamp, trans_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
timestamp lives_remaining user_id trans_id
10:00 4 3 5
10:00 5 3 6
10:00 3 3 1
10:00 2 3 2
SELECT *
FROM lives outer
WHERE (usr_id, time_stamp, trans_id) IN (
SELECT usr_id, time_stamp, trans_id
FROM lives sq
WHERE sq.usr_id = outer.usr_id
ORDER BY trans_id, time_stamp
LIMIT 1
)
SELECT (array_agg(tree.id ORDER BY tree_size.size)))[1]
FROM tree JOIN forest ON (tree.forest = forest.id)
GROUP BY forest.id
SELECT DISTINCT ON (usr_id)
time_stamp,
lives_remaining,
usr_id,
trans_id
FROM lives
ORDER BY usr_id, time_stamp DESC, trans_id DESC;
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
SELECT t.*
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY usr_id ORDER BY time_stamp DESC) as r
FROM lives) as t
WHERE t.r = 1