Sql 如何为过去六个月的数据连续四周循环BigQuery
例如,我在大查询中有如下表(按日期分区)。我必须为下面详细提到的这个问题编写标准sql查询Sql 如何为过去六个月的数据连续四周循环BigQuery,sql,google-bigquery,Sql,Google Bigquery,例如,我在大查询中有如下表(按日期分区)。我必须为下面详细提到的这个问题编写标准sql查询 student_id date duration(in hours) 1 2020-05-10 7 2 2020-05-10 8 3 2020-05-10 8 1 2020-05-11 8 2 2020-05-11 7 3
student_id date duration(in hours)
1 2020-05-10 7
2 2020-05-10 8
3 2020-05-10 8
1 2020-05-11 8
2 2020-05-11 7
3 2020-05-12 6
这是我们几乎每天都在添加数据的表,所以数据增长非常快。
我必须在过去六个月内找到连续四周出勤时间超过7小时的学生ID(每天检查一次,最近几个月本周将增加1周),并将学生类型转换为优秀学生。例如,在编程语言中
for(start week->1 - end_week-> 4 till last six months):
if duration >=7 for date
boolean true
start_week = 2 //start week is incremented by 1 week for next loop
end_week = 5
对于任何学生来说,如果过去六个月的任何连续4周数据持续时间大于等于7小时,则他是好学生。这对我来说似乎很有挑战性,因为我在bigquery和mysql方面的成绩一般。我不知道如何做到这一点。如果我理解正确,请将日期截短为周并进行合计。然后使用窗口函数获取所需的标志并进行筛选:
select t.*
from (select student_id, date_trunc(date, week) as wk, sum(duration) as dur,
min(sum(dur)) over (partition by student_id
order by unix_date(min(date_trunc(date, week)))
range between 21 preceding and current row
) as min_4week_dur
min(min(date_trunc(date, week))) over (partition by student_id) as min_wk
from t
group by 1, 2
) t
where datediff(min_wk, wk, week) >= 3 and
min_4week_dur > 7;
这两个关键思想是:
- 最小持续时间是计算运行四周期间的最小每周持续时间
- 有效学生仅在第四周或之后才有资格申请
- 这里是您的用例示例
# Only for initiate the test with your data
with sample as (
select 1 as ID, DATE("2020-05-10") as d, 7 as hour
union all
select 2 as ID, DATE("2020-05-10") as d, 8 as hour
union all
select 3 as ID, DATE("2020-05-10") as d, 8 as hour
union all
select 1 as ID, DATE("2020-05-11") as d, 8 as hour
union all
select 2 as ID, DATE("2020-05-11") as d, 7 as hour
union all
select 3 as ID, DATE("2020-05-12") as d, 6 as hour
),
# Create an array of date to take into account the missing days (important for the sum over the 28 previous days)
date_array as (
select dd from UNNEST(GENERATE_DATE_ARRAY('2020-05-10', '2020-05-15', INTERVAL 1 DAY)) dd
),
# Product of existing IDs and possible date on the range
data_grid as (
select distinct ID, dd from sample, date_array
),
# Perform a right outer join to add missing date to the logs that you have in your sample data
merged_data as (
select data_grid.ID,d,hour,dd from sample RIGHT outer join data_grid on sample.d = data_grid.dd and sample.ID = data_grid.ID
)
# Sum per ID the 27 previous day in sliding windows (every day, the day and the last 27 are added)
select ID,dd, SUM(hour)
OVER (
PARTITION BY ID
ORDER BY dd
ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
) AS total_purchases
from merged_data
查找连续4周出勤时间超过7小时的学生ID(过去几个月每天检查一次,本周增加一周)。在过去六个月内,并将学生转换为优秀学生
下面是BigQuery标准SQL
#standardSQL
SELECT * EXCEPT(duration_4_weeks, qualify_for_6_month_condition),
IF(qualify_for_6_month_condition AND
MAX(duration_4_weeks) OVER(PARTITION BY student_id) >= 7,
'good student',
NULL
) type
FROM (
SELECT *,
SUM(duration) OVER(
PARTITION BY student_id
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 27 PRECEDING AND CURRENT ROW
) duration_4_weeks,
date > DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH) qualify_for_6_month_condition
FROM `project.dataset.table`
)
感谢您的回复,其中一个问题是ID可以是多个而不是有限的。我的意思是它可以是100个或更多。在这种情况下,我们必须执行多个union allNo,union all在开始时只是模拟您的源数据。在真实数据库中执行查询以获取这些数据!谢谢,我正在尝试。它不会在大查询控制台中显示任何结果。为什么我们在四周内使用21行之前和当前行而不是28行。如果可能的话,请您解释一下这个查询。这个查询使用21,因为4周是前三周加上当前一周。在不使用外部
where
子句的情况下运行查询,以查看子查询返回的内容。如果我在注释Hanks commenting--datediff(min_wk,wk,week)>=3,并且持续时间超过300(每天表中的实际持续时间),则表示感谢然后结果连续一周出现一些问题,第一周没有2020-05-03的任何日期,该id的实际日期从2020-05-05开始。学生id,wk,dur,min_4 Week_dur,min_wk 12020-05-03171017102020-05-03 12020-05-1014001400202-05-03 12020-05-1724251400202-05-03 12020-05-24223614002020-05-03 12020-05-312309140020-05-03 12020-06-0712861286202-05-03 12020-09-0648183482020-05-03@akashkumar . . . 如果希望连续几周,请立即使用查询中的逻辑。如果您想查看最近4周的数据,请使用行编号()
。感谢您根据需要稍加修改。