Sql 基于非特定于架构的列值序列检索未知值
我希望根据时间值的相关事件值返回和操作时间值,但仅当特定的事件序列发生时。下面是一个简化的示例表:Sql 基于非特定于架构的列值序列检索未知值,sql,google-cloud-platform,google-bigquery,Sql,Google Cloud Platform,Google Bigquery,我希望根据时间值的相关事件值返回和操作时间值,但仅当特定的事件序列发生时。下面是一个简化的示例表: +--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+ | id | event1 | time1 | event2 | time2 | event3 | time3 | ev
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
| id | event1 | time1 | event2 | time2 | event3 | time3 | event4 | time4 | event5 | time5 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
| abc123 | firstevent | 10:00 | secondevent | 10:01 | thirdevent | 10:02 | fourthevent | 10:03 | fifthevent | 10:04 |
| abc123 | thirdevent | 10:10 | secondevent | 10:11 | thirdevent | 10:12 | firstevent | 10:13 | secondevent | 10:14 |
| def456 | thirdevent | 10:20 | firstevent | 10:21 | secondevent | 10:22 | thirdevent | 10:24 | fifthevent | 10:25 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+
对于此表,我们希望检索此特定事件序列发生的时间:firstevent
、secondevent
、thirdevent
,以及任何非零值的最终事件。意味着返回的相关条目如下:
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
| id | event1 | time1 | event2 | time2 | event3 | time3 | event4 | time4 | event5 | time5 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
| abc123 | firstevent | 10:00 | secondevent | 10:01 | thirdevent | 10:02 | fourthevent | 10:03 | null | null |
| null | null | null | null | null | null | null | null | null | null | null |
| def456 | null | null | firstevent | 10:21 | secondevent | 10:22 | thirdevent | 10:24 | fifthevent | 10:26 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
+-------------+-------------------------------+
| FinalEvent | AverageTimeBetweenFinalEvents |
+-------------+-------------------------------+
| fourthevent | 1:00 |
| fifthevent | 2:00 |
+-------------+-------------------------------+
如上所示,列与序列的出现无关,从event1
和event2
列开始返回两个结果,因此解决方案应该是独立的,并且支持n个列。然后,这些值可以通过在3个固定变量之后的序列中发生的最后一个非零事件进行聚合,得到如下结果:
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
| id | event1 | time1 | event2 | time2 | event3 | time3 | event4 | time4 | event5 | time5 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
| abc123 | firstevent | 10:00 | secondevent | 10:01 | thirdevent | 10:02 | fourthevent | 10:03 | null | null |
| null | null | null | null | null | null | null | null | null | null | null |
| def456 | null | null | firstevent | 10:21 | secondevent | 10:22 | thirdevent | 10:24 | fifthevent | 10:26 |
+--------+------------+-------+-------------+-------+-------------+-------+-------------+-------+------------+-------+
+-------------+-------------------------------+
| FinalEvent | AverageTimeBetweenFinalEvents |
+-------------+-------------------------------+
| fourthevent | 1:00 |
| fifthevent | 2:00 |
+-------------+-------------------------------+
下面是BigQuery标准SQL
#standardSQL
WITH search_events AS (
SELECT ['firstevent', 'secondevent', 'thirdevent'] search
), temp AS (
SELECT *, REGEXP_EXTRACT(events, CONCAT(search, r',(\w*)')) FinalEvent
FROM (
SELECT id, [time1, time2, time3, time4, time5] times,
(SELECT STRING_AGG(event) FROM UNNEST([event1, event2, event3, event4, event5]) event) events,
(SELECT STRING_AGG(search) FROM UNNEST(search) search) search
FROM `project.dataset.table`, search_events
)
)
SELECT FinalEvent,
times[SAFE_OFFSET(ARRAY_LENGTH(REGEXP_EXTRACT_ALL(REGEXP_EXTRACT(events, CONCAT(r'(.*?)', search, ',', FinalEvent )), ',')) + 3)] time
FROM temp
WHERE IFNULL(FinalEvent, '') != ''
如果要应用于您问题中的样本数据-结果为
Row FinalEvent time
1 fourthevent 10:03
2 fifthevent 10:25
因此,正如您所看到的,所有的最终事件都会被提取出来,以及它们各自的时间现在,你可以在这里做你需要的任何分析-我不确定事件之间的
平均时间背后的逻辑,所以我把这个留给你-特别是我认为问题的主要焦点是提取那些最终事件
请您提供此声明背后的逻辑,好吗?
时间[安全偏移量(数组长度(REGEXP\u EXTRACT\u ALL(REGEXP\u EXTRACT(事件,CONCAT(r'(.*?),search,,',FinalEvent)),',)+3)]时间
当然,下面的希望有助于理解这个表达式背后的逻辑
汇编正则表达式以提取匹配事件之前发生的事件列表
提取这些事件
将所有逗号提取到数组中
通过取上面数组中的逗号数+3来计算最终事件的位置(三表示搜索序列中的位置数)
将各自的时间提取为时间数组的一个元素
到目前为止,您的查询是什么?@mtr.web老实说,我真的不知道从哪里开始,我希望找到类似于LEAD函数的内容,但针对列而不是行,尽管我似乎找不到类似的内容。这很好,我希望通过AverageTimeBetweenFinalEvents
实现的是从倒数第二个事件的时间戳中获取最终事件的时间戳以获得一个数值,然后在所有finalevent实例中获取该值的平均值。如果不太麻烦的话,请你提供这句话背后的逻辑,好吗<代码>时间[SAFE_OFFSET(数组长度(REGEXP_EXTRACT_ALL(REGEXP_EXTRACT(事件,CONCAT(r'(.*?),search,,',,FinalEvent)),,,,)+3)]时间
当然,请参见我答案中的加法