Sql 如何为过去六个月的数据连续四周循环BigQuery_Sql_Google Bigquery

Sql 如何为过去六个月的数据连续四周循环BigQuery

sql google-bigquery

Sql 如何为过去六个月的数据连续四周循环BigQuery,sql,google-bigquery,Sql,Google Bigquery,例如，我在大查询中有如下表（按日期分区）。我必须为下面详细提到的这个问题编写标准sql查询 student_id date duration(in hours) 1 2020-05-10 7 2 2020-05-10 8 3 2020-05-10 8 1 2020-05-11 8 2 2020-05-11 7 3

例如，我在大查询中有如下表（按日期分区）。我必须为下面详细提到的这个问题编写标准sql查询

student_id     date      duration(in hours)
  1          2020-05-10   7             
  2          2020-05-10   8
  3          2020-05-10   8
  1          2020-05-11   8
  2          2020-05-11   7
  3          2020-05-12   6

这是我们几乎每天都在添加数据的表，所以数据增长非常快。我必须在过去六个月内找到连续四周出勤时间超过7小时的学生ID（每天检查一次，最近几个月本周将增加1周），并将学生类型转换为优秀学生。例如，在编程语言中

for(start week->1 - end_week-> 4 till last six months):
      if duration >=7 for date
        boolean true
      start_week = 2 //start week is incremented by 1 week for next loop
      end_week = 5

对于任何学生来说，如果过去六个月的任何连续4周数据持续时间大于等于7小时，则他是好学生。这对我来说似乎很有挑战性，因为我在bigquery和mysql方面的成绩一般。我不知道如何做到这一点。

如果我理解正确，请将日期截短为周并进行合计。然后使用窗口函数获取所需的标志并进行筛选：

select t.*
from (select student_id, date_trunc(date, week) as wk, sum(duration) as dur,
             min(sum(dur)) over (partition by student_id
                                 order by unix_date(min(date_trunc(date, week)))
                                 range between 21 preceding and current row
                                ) as min_4week_dur
             min(min(date_trunc(date, week))) over (partition by student_id) as min_wk
      from t
      group by 1, 2
     ) t
where datediff(min_wk, wk, week) >= 3 and
      min_4week_dur > 7;

这两个关键思想是：

最小持续时间是计算运行四周期间的最小每周持续时间
有效学生仅在第四周或之后才有资格申请

# Only for initiate the test with your data
with sample as (
  select 1 as ID,  DATE("2020-05-10") as d, 7 as hour
  union all             
  select 2 as ID,  DATE("2020-05-10") as d, 8 as hour
  union all
  select 3 as ID,  DATE("2020-05-10") as d, 8 as hour
  union all
  select 1 as ID,  DATE("2020-05-11") as d, 8 as hour
  union all
  select 2 as ID,  DATE("2020-05-11") as d, 7 as hour
  union all
  select 3 as ID,  DATE("2020-05-12") as d, 6 as hour
), 
# Create an array of date to take into account the missing days (important for the sum over the 28 previous days)
date_array as (
  select  dd from UNNEST(GENERATE_DATE_ARRAY('2020-05-10', '2020-05-15', INTERVAL 1 DAY)) dd
), 
# Product of existing IDs and possible date on the range
data_grid as (
  select distinct ID, dd from sample, date_array
), 
# Perform a right outer join to add missing date to the logs that you have in your sample data
merged_data as (
select data_grid.ID,d,hour,dd from sample RIGHT outer join data_grid on sample.d = data_grid.dd and sample.ID = data_grid.ID
)
# Sum per ID the 27 previous day in sliding windows (every day, the day and the last 27 are added)
select ID,dd, SUM(hour)
  OVER (
    PARTITION BY ID
    ORDER BY dd
    ROWS BETWEEN 27 PRECEDING AND CURRENT ROW
  ) AS total_purchases
  from merged_data

超过7小时的学生ID

在过去六个月内，并将学生转换为优秀学生
下面是BigQuery标准SQL

#standardSQL SELECT * EXCEPT(duration_4_weeks, qualify_for_6_month_condition), IF(qualify_for_6_month_condition AND MAX(duration_4_weeks) OVER(PARTITION BY student_id) >= 7, 'good student', NULL ) type FROM ( SELECT *, SUM(duration) OVER( PARTITION BY student_id ORDER BY UNIX_DATE(date) RANGE BETWEEN 27 PRECEDING AND CURRENT ROW ) duration_4_weeks, date > DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH) qualify_for_6_month_condition FROM `project.dataset.table` )

感谢您的回复，其中一个问题是ID可以是多个而不是有限的。我的意思是它可以是100个或更多。在这种情况下，我们必须执行多个union allNo，union all在开始时只是模拟您的源数据。在真实数据库中执行查询以获取这些数据！谢谢，我正在尝试。它不会在大查询控制台中显示任何结果。为什么我们在四周内使用21行之前和当前行而不是28行。如果可能的话，请您解释一下这个查询。这个查询使用21，因为4周是前三周加上当前一周。在不使用外部
where
子句的情况下运行查询，以查看子查询返回的内容。如果我在注释Hanks commenting--datediff（min_wk，wk，week）>=3，并且持续时间超过300（每天表中的实际持续时间），则表示感谢然后结果连续一周出现一些问题，第一周没有2020-05-03的任何日期，该id的实际日期从2020-05-05开始。学生id，wk，dur，min_4 Week_dur，min_wk 12020-05-03171017102020-05-03 12020-05-1014001400202-05-03 12020-05-1724251400202-05-03 12020-05-24223614002020-05-03 12020-05-312309140020-05-03 12020-06-0712861286202-05-03 12020-09-0648183482020-05-03@akashkumar . . . 如果希望连续几周，请立即使用查询中的逻辑。如果您想查看最近4周的数据，请使用
行编号（）
。感谢您根据需要稍加修改。