Sql 在巨大的事件流中寻找差距?
我在PostgreSQL数据库中有大约100万个以下格式的事件:Sql 在巨大的事件流中寻找差距?,sql,mongodb,algorithm,postgresql,bigdata,Sql,Mongodb,Algorithm,Postgresql,Bigdata,我在PostgreSQL数据库中有大约100万个以下格式的事件: id | stream_id | timestamp ----------+-----------------+----------------- 1 | 7 | .... 2 | 8 | .... 大约有50000条独特的溪流 我需要找到任意两个事件之间的时间在某个时间段内的所有事件。换句话说,我需要找到在某
id | stream_id | timestamp
----------+-----------------+-----------------
1 | 7 | ....
2 | 8 | ....
大约有50000条独特的溪流
我需要找到任意两个事件之间的时间在某个时间段内的所有事件。换句话说,我需要找到在某个时间段内没有事件的事件对
例如:
a b c d e f g h i j k
| | | | | | | | | | |
\____2 mins____/
在这个场景中,我想找到一对(f,g),因为这些是紧靠着一个间隙的事件
我不在乎查询是否(那么)慢,也就是说,在100万条记录上,如果需要一个小时左右就可以了。然而,数据集将继续增长,因此,如果数据集增长缓慢,则有望正常扩展
我也有MongoDB中的数据
执行此查询的最佳方式是什么?在postgres中,借助lag()窗口函数可以非常轻松地完成此查询。以下面的小提琴为例: PostgreSQL 9.3架构设置:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |
查询1:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |
:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |
您可以通过按时间戳排序的stream_id在分区上使用window函数来实现这一点。lag()
函数允许您访问分区中的前几行;如果没有滞后值,则为上一行。因此,如果stream_id上的分区是按时间排序的,那么前一行就是该stream_id的前一个事件
SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");
事件是否由单个时间点(相对于时间跨度)定义?您希望每个流都定义它吗?换句话说,差距在于属于单个流的事件之间?@MOehm是的,单个时间点您是否也需要MongoDB的解决方案?能否向我们展示MongoDB中的示例文档和预期输出?