表的SQL操作(聚合和分组)
我想使用bigquery进行每日查询,比较昨天和今天不同指标的总和。示例数据集如下所示: 假设今天是2019年12月23日,查询将汇总不同客户今天12月23日和昨天12月22日的不同指标收入、成本、利润,如果SumDayed/SumDayed不在0.5-1.5的阈值范围内,则将其标记为异常 每天都会进行查询,只需添加新结果即可。理想情况下,最终表格如下所示:表的SQL操作(聚合和分组),sql,google-bigquery,Sql,Google Bigquery,我想使用bigquery进行每日查询,比较昨天和今天不同指标的总和。示例数据集如下所示: 假设今天是2019年12月23日,查询将汇总不同客户今天12月23日和昨天12月22日的不同指标收入、成本、利润,如果SumDayed/SumDayed不在0.5-1.5的阈值范围内,则将其标记为异常 每天都会进行查询,只需添加新结果即可。理想情况下,最终表格如下所示: WITH unpivoted AS ( SELECT date , 'revenue' A
WITH unpivoted AS
(
SELECT
date
, 'revenue' AS metrics
, SUM( revenue ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
UNION ALL
SELECT
date
, 'cost' AS metrics
, SUM( cost ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
-- add more desired metrics
)
SELECT
date as date_generated
, cust_id
, metrics
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL 0 DAY ) THEN amount END ) AS today
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL -1 DAY ) THEN amount END ) AS yesterday
...
FROM
unpivoted
WHERE
date >= DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY )
AND date <= DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY )
GROUP
BY
date, cust_id, metrics
我主要关心的是,我能够仅针对一个指标(即收入)执行此操作,但不确定如何应用于所有指标并使查询更高效。这是我写的代码
SELECT cust_id,
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
THEN revenue
END) AS sum(yesterday),
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY)
THEN revenue
END) AS sum(today),
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
THEN revenue
END) / SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY)
THEN revenue
END) as ratio,
FROM `dataset`
GROUP BY cust_id
代码告诉我:
对于问题不够清晰,我提前表示歉意,因为我对这个问题还不熟悉,不知道如何更准确地表达这个问题。我的建议是将源数据放在Excel数据透视表中。将“值”组移动到行以获得所需的视图 但是,如果您想坚持使用SQL,则需要首先取消对行的分割,将每个度量值放在单独的行中,然后对中间结果进行分组,如下所示:
WITH unpivoted AS
(
SELECT
date
, 'revenue' AS metrics
, SUM( revenue ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
UNION ALL
SELECT
date
, 'cost' AS metrics
, SUM( cost ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
-- add more desired metrics
)
SELECT
date as date_generated
, cust_id
, metrics
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL 0 DAY ) THEN amount END ) AS today
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL -1 DAY ) THEN amount END ) AS yesterday
...
FROM
unpivoted
WHERE
date >= DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY )
AND date <= DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY )
GROUP
BY
date, cust_id, metrics
您可以汇总数据,然后使用lag或join输入前几天的数据:
with t as (
select cust_id, date,
sum(revenue) as revenue,
sum(cost) as cost,
sum(profit) as profit
from dataset
where date >= date_add(current_date, interval -1 day)
group by cust_id, date
)
select t.cust_id,
today, yesterday
from t today left join
t yesterday
on yesterday.cust_id = today.cust_id and
yesterday.date = date_add(current_date, interval -1 day)
where today.date = current_date;
可以先取消填充列,然后对结果进行分组。之后,您可能需要使用LAG在同一行中显示一天和前一天的数据
WITH unpivoted AS
(
SELECT
date,
'revenue' AS metrics,
SUM( revenue ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
UNION ALL
SELECT
date,
'cost' AS metrics,
SUM( cost ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
UNION ALL
SELECT
date,
'profit' AS metrics,
SUM( profit ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
)
SELECT
date as date_generated,
metrics,
cust_id,
LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) yesterday,
SUM( amount ) AS today,
LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount) as ratio,
CASE WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount)<0.5 then 'TRUE'
WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount)>1.5 then 'TRUE'
WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount) is NULL then 'TRUE'
ELSE 'FALSE'
END as anomalous
FROM
unpivoted
WHERE date >= DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY ) AND date <= DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY )
GROUP BY
date_generated, cust_id, metrics
ORDER BY
date_generated, metrics, cust_id
请注意,在使用WHERE子句时,我的解决方案仅限于当天和前一天的今天和昨天,因此这可以用于聚合两天以上的度量