Time 是否有其他解决方案不带with子句按筛选日期计算数据?
在BigQuery中,当dateTime 是否有其他解决方案不带with子句按筛选日期计算数据?,time,google-bigquery,Time,Google Bigquery,在BigQuery中,当date
+------+------------+------------------+
| Name | date | order_id | value |
+------+------------+----------+-------+
| JONES| 2019-01-03 | 11 | 10 |
| JONES| 2019-01-05 | 12 | 5 |
| JONES| 2019-06-03 | 13 | 3 |
| JONES| 2019-07-03 | 14 | 20 |
| John | 2019-07-23 | 15 | 10 |
+------+------------+----------+-------+
我的解决办法是:
WITH data AS (
SELECT "JONES" name, DATE("2019-01-03") date_time, 11 order_id, 10 value
UNION ALL
SELECT "JONES", DATE("2019-01-05"), 12, 5
UNION ALL
SELECT "JONES", DATE("2019-06-03"), 13, 3
UNION ALL
SELECT "JONES", DATE("2019-07-03"), 14, 20
UNION ALL
SELECT "John", DATE("2019-07-23"), 15, 10
),
data2 AS (
SELECT *, MIN(date_time) OVER (PARTITION BY name) min_date
FROM data
)
SELECT name,
ARRAY_AGG(STRUCT(order_id as f_id, date_time as f_date) ORDER BY order_id LIMIT 1)[OFFSET(0)].*,
sum(case when date_time< date_add(min_date,interval 3 day) then value end) as total_value_day3,
SUM(value) AS total
FROM data2
GROUP BY name
那么我的问题是,能不能用一种更有效的方法进行同样的计算?
或者这个解决方案适用于大型数据集?下面的解决方案在不使用窗口函数或数组聚合的情况下获得相同的结果,因此BQ必须减少排序/分区。对于这个小示例,我的查询运行时间更长,但字节洗牌更少。如果你在一个更大的数据集上运行这个,我认为我的会更有效率
WITH data AS (
SELECT "JONES" name, DATE("2019-01-03") date_time, "11" order_id, 10 value UNION ALL
SELECT "JONES", DATE("2019-01-05"), "12", 5 UNION ALL
SELECT "JONES", DATE("2019-06-03"), "13", 3 UNION ALL
SELECT "JONES", DATE("2019-07-03"), "14", 20 UNION ALL
SELECT "John", DATE("2019-07-23"), "15", 10
),
aggs as (
select name, min(date_time) as first_order_date, min(order_id) as first_order_id, sum(value) as total
from data
group by 1
)
select
name,
first_order_id as f_id,
first_order_date as f_date,
sum(value) as total_value_day3,
total
from aggs
inner join data using(name)
where date_time < date_add(first_order_date, interval 3 day) -- <= perhaps
group by 1,2,3,5
注意,这假设订单id是连续的,即订单id 11总是出现在订单id 12之前,与日期是连续的方式相同。我建议查看以了解您的查询在currect数据集上的作用。。很难预测这个查询将如何在更大的大数据集上扩展,因为我们不知道表结构和/或定义的索引。我认为这个解决方案足够有效。如果数据没有严重倾斜,也就是说,窗口中有数百万行的名称,那么它应该可以在相当大的数据集上正常工作。小测验:order_id在这里是字符串数据类型-所以你认为110会在12之后吗?!答:希望他在CTE中的字符串化顺序id是一个错误,并且他实际的表将其存储为Inti。我真的怀疑这是一种情况,因为在现实生活中,在大多数情况下,这样的字段不仅仅是简单的整数:o,但谁知道呢?特别是OP接受了答案:是的,我这边关于顺序id的类型是错误的
WITH data AS (
SELECT "JONES" name, DATE("2019-01-03") date_time, "11" order_id, 10 value UNION ALL
SELECT "JONES", DATE("2019-01-05"), "12", 5 UNION ALL
SELECT "JONES", DATE("2019-06-03"), "13", 3 UNION ALL
SELECT "JONES", DATE("2019-07-03"), "14", 20 UNION ALL
SELECT "John", DATE("2019-07-23"), "15", 10
),
aggs as (
select name, min(date_time) as first_order_date, min(order_id) as first_order_id, sum(value) as total
from data
group by 1
)
select
name,
first_order_id as f_id,
first_order_date as f_date,
sum(value) as total_value_day3,
total
from aggs
inner join data using(name)
where date_time < date_add(first_order_date, interval 3 day) -- <= perhaps
group by 1,2,3,5