SQL中基于层次结构的重叠时间间隔合并
我试图解决一个问题,即我想要合并给定列id的重叠间隔,但我也想根据层次结构/优先级合并它们。我有每个间隔的开始时间和停止时间,每个间隔都有与其相关的层次结构/优先级 表中有以下列:SQL中基于层次结构的重叠时间间隔合并,sql,google-bigquery,Sql,Google Bigquery,我试图解决一个问题,即我想要合并给定列id的重叠间隔,但我也想根据层次结构/优先级合并它们。我有每个间隔的开始时间和停止时间,每个间隔都有与其相关的层次结构/优先级 表中有以下列: id, start_time, stop_time, priority 我能够解决我没有考虑优先权的问题,但我正在努力解决这个问题 Red colour: p1 (priority 1) Blue Colour: p2 (priority 2) Green colour: p3 (priority 3) 请注意,
id, start_time, stop_time, priority
我能够解决我没有考虑优先权的问题,但我正在努力解决这个问题
Red colour: p1 (priority 1)
Blue Colour: p2 (priority 2)
Green colour: p3 (priority 3)
请注意,在下面的示例输入中,我们将有9行具有相同的id,而输出将有6行。请注意,对于某些id,可能只有一些优先级值或只有一个优先级值,解决方案应该考虑到这一点
预期投入和产出:
这是一个复杂的问题。一个解决方案是找到岛屿的起点,并对起点进行累计。您可以通过查看没有重叠的位置来确定开始位置:
select id, priority, min(start_time), max(stop_time)
from (select t.*,
countif(coalesce(prev_stop_time, stop_time) < stop_time) over (partition by id, priority order by start_time) as grp
from (select t.*,
max(stop_time) over (partition by id, priority order by start_time rows between unbounded preceding and 1 preceding) as prev_stop_time
from t
) t
) t
group by id, priority, grp;
下面是BigQuery标准SQL
#standardSQL
WITH check_times AS (
SELECT id, start_time AS time FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS time FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, time AS start_time, LEAD(time) OVER(PARTITION BY id ORDER BY time) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time, MIN(priority) priority
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, ANY_VALUE(priority) priority
FROM (
SELECT id, start_time, stop_time, priority, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT id, start_time, stop_time, priority,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) OR
priority != IFNULL(LAG(priority) OVER(PARTITION BY id ORDER BY start_time), -1) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time
如果要应用于您问题中的样本数据-结果为
您是否也可以共享一个解决方案,在该解决方案中,我们只基于id而不基于优先级列合并间隔 我只是稍微调整了一下上面的查询,以忽略优先级
#standardSQL
WITH check_times AS (
SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS TIME FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time
FROM (
SELECT id, start_time, stop_time, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT id, start_time, stop_time,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time
结果
Row id start_time stop_time
1 1 0 36
2 1 41 47
MySQL还是BigQuery?他们完全不同。典型的缺口和孤岛问题。@GordonlinoffBigQuery。加了一些details@GordonLinoff另外,请不要告诉我,我需要为一个给定的id组合这些。我修改了我的答案以考虑到这一点。非常感谢@mikhail,非常好用。我将在更多的样本数据上测试它,以验证解决方案。当然。记住——答案基于你在问题中提出的逻辑和样本数据。所以,若你们要进入一个并没有涉及的案例,请发布新的问题。同时考虑表决,如果确认工作-回来接受:你也可以分享一个解决方案,我们合并间隔基于ID和没有优先列。我想看看它对我提出的解决方案的表现。我想比较一下运行时间,因为我处理的数据非常大。谢谢。将此添加到第二个解决方案中的我的answerHi@Mikhail合并中,而不考虑优先级。是否可以修改解决方案,以便我们在该时间段内再增加一列优先级,即优先级的最大值?因此,我们将有一个额外的优先级列,第10到36行的持续时间值为3,第2行的持续时间值为41到47。