Google bigquery 查询以查找在bigquery中逐个创建的记录
我在玩弄bigquery。给出了以下输入:Google bigquery 查询以查找在bigquery中逐个创建的记录,google-bigquery,Google Bigquery,我在玩弄bigquery。给出了以下输入: +---------------+---------+---------+--------+----------------------+ | customer | agent | value | city | timestamp | +---------------+---------+---------+--------+----------------------+ | 1 |
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
| 3 | 2 | 309 | NY | 2019-02-12 06:41pm |
| 1 | 1 | 654 | LA | 2019-02-12 05:12pm |
+---------------+---------+---------+--------+----------------------+
我想找到一个又一个代理在5分钟内发出的交易。因此,上表的输出应如下所示:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
+---------------+---------+---------+--------+----------------------+
查询应该以某种方式按代理分组并查找此类事务。但是,从输出中可以看出,结果并没有真正分组。我的第一个想法是使用LEAD函数,但我不确定。你有什么想法吗
查询的想法:
按代理和时间戳描述排序
从第一行开始,用铅看下一行?
检查时间戳差异是否小于5分钟
如果是,这两行应该在输出中
继续下一个第二排
当第二行和第三行也符合条件时,第二行将进入输出,这将导致重复行。我还不知道如何避免这种情况。一定有更简单的方法,但这能实现你的目标吗
CTE2 AS (
SELECT customer, agent, value, city, timestamp,
lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)
SELECT agent,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)
下面是BigQuery标准SQL
#standardSQL
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM `project.dataset.yourtable`
)
WHERE NOT next_customer IS NULL
我可能误解了数据,但你能不能不简单地按代理和时间戳排序?是的,这可能是第一步。排序后,必须查看第一行,查看第二行,看看时间戳差异是否小于5分钟,客户是否相同。这应该在所有行中重复。这看起来非常好!非常感谢。这种连接确实有意义。不过,为客户和客户线索设立专栏会更好。当然。当timestamp\u difftimestamp\u lead,timestamp,MINUTE时,只保留现有列和_lead列
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer, 1 agent, 106 value,'LA' city, '2019-02-12 03:05pm' ts UNION ALL
SELECT 1, 1, 251,'LA', '2019-02-12 03:06pm' UNION ALL
SELECT 3, 2, 309,'NY', '2019-02-12 06:41pm' UNION ALL
SELECT 1, 1, 654,'LA', '2019-02-12 05:12pm'
), temp AS (
SELECT customer, agent, value, city, PARSE_TIMESTAMP('%Y-%m-%d %I:%M%p', ts) ts
FROM `project.dataset.table`
)
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM temp
)
WHERE NOT next_customer IS NULL
-- ORDER BY ts
Row customer agent value city ts next_customer next_value
1 1 1 106 LA 2019-02-12 15:05:00 UTC 1 251