Sql 使用窗口函数对每个事件行在给定时间间隔内之前发生的事件进行计数
我有一个表存储用户发生的事件,如中所示 数据示例如下所示:Sql 使用窗口函数对每个事件行在给定时间间隔内之前发生的事件进行计数,sql,postgresql,aggregate-functions,window-functions,postgresql-performance,Sql,Postgresql,Aggregate Functions,Window Functions,Postgresql Performance,我有一个表存储用户发生的事件,如中所示 数据示例如下所示: +-----------+----------+-------------+----------------------------+ | event_id | user_id | event_type | timestamp | +-----------+----------+-------------+----------------------------+ | 1 |
+-----------+----------+-------------+----------------------------+
| event_id | user_id | event_type | timestamp |
+-----------+----------+-------------+----------------------------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+
我想得到,对于每个事件,事件发生前30天内发生的相同用户和相同事件类型的事件数
它应该如下所示:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 3 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 3 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 4 |
+-----------+----------+-------------+-----------------------------+-------+
该表包含数百万行,因此我无法使用下面答案中@jpw建议的相关子查询
到目前为止,我通过使用以下查询获得了以前使用相同用户id和相同事件id发生的事件总数:
SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);
结果如下:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 5 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 6 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 7 |
+-----------+----------+-------------+-----------------------------+-------+
您知道是否有办法更改窗口框架规范或计数函数,以便只返回x天内发生的事件数
第二次,我想排除重复的事件,即相同的事件类型和相同的时间戳。可能您已经知道如何使用子查询解决此问题,并且正在专门要求使用窗口函数的解决方案,如果是这样,此答案可能因此无效,但是,如果您对任何可能的解决方案感兴趣,那么使用相关子查询很容易解决这个问题,尽管我怀疑性能可能不好:
select
event_id, user_id,event_type,"timestamp",
(
select count(distinct timestamp)
from events
where timestamp >= e.timestamp - interval '30 days'
and timestamp <= e.timestamp
and user_id = e.user_id
and event_type = e.event_type
group by event_type, user_id
) as "count"
FROM events e
order by event_id;
也许您已经知道如何使用子查询解决此问题,并且正在专门要求使用窗口函数的解决方案,如果是这样,此答案可能因此无效,但是如果您对任何可能的解决方案感兴趣,那么使用相关子查询解决此问题很容易,尽管我怀疑性能可能会很差:
select
event_id, user_id,event_type,"timestamp",
(
select count(distinct timestamp)
from events
where timestamp >= e.timestamp - interval '30 days'
and timestamp <= e.timestamp
and user_id = e.user_id
and event_type = e.event_type
group by event_type, user_id
) as "count"
FROM events e
order by event_id;
也许您已经知道如何使用子查询解决此问题,并且正在专门要求使用窗口函数的解决方案,如果是这样,此答案可能因此无效,但是如果您对任何可能的解决方案感兴趣,那么使用相关子查询解决此问题很容易,尽管我怀疑性能可能会很差:
select
event_id, user_id,event_type,"timestamp",
(
select count(distinct timestamp)
from events
where timestamp >= e.timestamp - interval '30 days'
and timestamp <= e.timestamp
and user_id = e.user_id
and event_type = e.event_type
group by event_type, user_id
) as "count"
FROM events e
order by event_id;
也许您已经知道如何使用子查询解决此问题,并且正在专门要求使用窗口函数的解决方案,如果是这样,此答案可能因此无效,但是如果您对任何可能的解决方案感兴趣,那么使用相关子查询解决此问题很容易,尽管我怀疑性能可能会很差:
select
event_id, user_id,event_type,"timestamp",
(
select count(distinct timestamp)
from events
where timestamp >= e.timestamp - interval '30 days'
and timestamp <= e.timestamp
and user_id = e.user_id
and event_type = e.event_type
group by event_type, user_id
) as "count"
FROM events e
order by event_id;
这是笨拙的,但它的工作。CTE的性能可能比@jpw的计数相关子查询差
WITH ding AS (
SELECT user_id, event_type , ztimestamp
, row_number() OVER( PARTITION BY user_id, event_type
ORDER BY ztimestamp) AS rnk
FROM events
)
SELECT d1.*
, 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
SELECT *
FROM ding nx
WHERE nx.user_id = d0.user_id
AND nx.event_type = d0.event_type
AND nx.ztimestamp < d0.ztimestamp
AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
)
;
这是笨拙的,但它的工作。CTE的性能可能比@jpw的计数相关子查询差
WITH ding AS (
SELECT user_id, event_type , ztimestamp
, row_number() OVER( PARTITION BY user_id, event_type
ORDER BY ztimestamp) AS rnk
FROM events
)
SELECT d1.*
, 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
SELECT *
FROM ding nx
WHERE nx.user_id = d0.user_id
AND nx.event_type = d0.event_type
AND nx.ztimestamp < d0.ztimestamp
AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
)
;
这是笨拙的,但它的工作。CTE的性能可能比@jpw的计数相关子查询差
WITH ding AS (
SELECT user_id, event_type , ztimestamp
, row_number() OVER( PARTITION BY user_id, event_type
ORDER BY ztimestamp) AS rnk
FROM events
)
SELECT d1.*
, 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
SELECT *
FROM ding nx
WHERE nx.user_id = d0.user_id
AND nx.event_type = d0.event_type
AND nx.ztimestamp < d0.ztimestamp
AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
)
;
这是笨拙的,但它的工作。CTE的性能可能比@jpw的计数相关子查询差
WITH ding AS (
SELECT user_id, event_type , ztimestamp
, row_number() OVER( PARTITION BY user_id, event_type
ORDER BY ztimestamp) AS rnk
FROM events
)
SELECT d1.*
, 1+ d1.rnk - d0.rnk AS diff
FROM ding d1
JOIN ding d0 USING (user_id,event_type)
WHERE d1.ztimestamp >= d0.ztimestamp
AND d1.ztimestamp < d0.ztimestamp + '30 days'::interval
AND NOT EXISTS (
SELECT *
FROM ding nx
WHERE nx.user_id = d0.user_id
AND nx.event_type = d0.event_type
AND nx.ztimestamp < d0.ztimestamp
AND nx.ztimestamp > d1.ztimestamp - '30 days'::interval
)
;
我发现一个有效的请求:
SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
SELECT e.event_id, e.user_id,e.event_type,"timestamp",
last_value("timestamp") OVER w as lv,
unnest(array_agg(e."timestamp") OVER w) as agg
FROM events e
WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;
在我的dev机器上,有1000行样本,执行需要49毫秒。对于10000行样本,使用时间戳上的索引,它需要8277ms,而@jpw的查询需要6720ms。对于50000行的示例,两个查询都需要100秒以上的时间,因此我没有测试:
我发现一个有效的请求:
SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
SELECT e.event_id, e.user_id,e.event_type,"timestamp",
last_value("timestamp") OVER w as lv,
unnest(array_agg(e."timestamp") OVER w) as agg
FROM events e
WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;
在我的dev机器上,有1000行样本,执行需要49毫秒。对于10000行样本,使用时间戳上的索引,它需要8277ms,而@jpw的查询需要6720ms。对于50000行的示例,两个查询都需要100秒以上的时间,因此我没有测试:
我发现一个有效的请求:
SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
SELECT e.event_id, e.user_id,e.event_type,"timestamp",
last_value("timestamp") OVER w as lv,
unnest(array_agg(e."timestamp") OVER w) as agg
FROM events e
WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;
在我的dev机器上,有1000行样本,执行需要49毫秒。对于10000行样本,使用时间戳上的索引,它需要8277ms,而@jpw的查询需要6720ms。对于50000行的示例,两个查询都需要100秒以上的时间,因此我没有测试:
我发现一个有效的请求:
SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
SELECT e.event_id, e.user_id,e.event_type,"timestamp",
last_value("timestamp") OVER w as lv,
unnest(array_agg(e."timestamp") OVER w) as agg
FROM events e
WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;
在我的dev机器上,有1000行样本,执行需要49毫秒。对于10000行样本,使用时间戳上的索引,它需要8277ms,而@jpw的查询需要6720ms。对于50000行的示例,两个查询都需要100秒以上的时间,因此我没有测试:
我提供了一个更详细的答案,加上下面的提琴 基本上:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
以及:
SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;
或:
我提供了一个更详细的答案,加上下面的提琴 基本上:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
以及:
SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;
或:
我提供了一个更详细的答案,加上下面的提琴 基本上:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
以及:
SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;
或:
我提供了一个更详细的答案,加上下面的提琴 基本上:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
以及:
SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;
或:
我在这里发现了一个类似的问题:。我在这里发现了一个类似的问题:。我在这里发现了一个类似的问题:。谢谢你的回答。是的,我应该更准确地回答这个问题。该表包含数百万行,因此您的答案虽然正确,但不能很好地与卷进行缩放:+1尽管我怀疑可能是这样。我会再考虑一下。谢谢你的回答。是的,我应该
在这个问题上更准确。该表包含数百万行,因此您的答案虽然正确,但不能很好地与卷进行缩放:+1尽管我怀疑可能是这样。我会再考虑一下。谢谢你的回答。是的,我应该更准确地回答这个问题。该表包含数百万行,因此您的答案虽然正确,但不能很好地与卷进行缩放:+1尽管我怀疑可能是这样。我会再考虑一下。谢谢你的回答。是的,我应该更准确地回答这个问题。该表包含数百万行,因此您的答案虽然正确,但不能很好地与卷进行缩放:+1尽管我怀疑可能是这样。我们再仔细考虑一下。事实上,执行计划看起来不太好:->:在生成的1000行样本上执行需要693毫秒@jpw对同一样本进行了71毫秒的检测。我将提议的一个需要49毫秒。实际上,执行计划看起来不太好:->:在生成的1000行样本上执行需要693毫秒@jpw对同一样本进行了71毫秒的检测。我将提议的一个需要49毫秒。实际上,执行计划看起来不太好:->:在生成的1000行样本上执行需要693毫秒@jpw对同一样本进行了71毫秒的检测。我将提议的一个需要49毫秒。实际上,执行计划看起来不太好:->:在生成的1000行样本上执行需要693毫秒@jpw对同一样本进行了71毫秒的检测。我提议的一个用了49毫秒。谢谢你的回答,它真的快了,我将在dba.SEThanks上详细说明基准测试的答案,它真的快了,我将在dba.SEThanks上详细说明基准测试的答案,它真的快了,我将在dba.SEThanks上详细说明基准测试的答案,它真的快了,我将在dba.SE上详细介绍基准测试