Google bigquery 在BigQuery中每月获取最后一条已知记录

Google bigquery 在BigQuery中每月获取最后一条已知记录,google-bigquery,Google Bigquery,账户余额集合,显示客户在给定日期的账户余额: +---------------+---------+------------+ | customer_id | value | timestamp | +---------------+---------+------------+ | 1 | -500 | 2019-10-12 | | 1 | -300 | 2019-10-11 | | 1 | -20

账户余额集合,显示客户在给定日期的账户余额:

+---------------+---------+------------+
|  customer_id  |  value  | timestamp  |
+---------------+---------+------------+
| 1             |  -500   | 2019-10-12 |
| 1             |  -300   | 2019-10-11 |
| 1             |  -200   | 2019-10-10 |
| 1             |  0      | 2019-10-09 |
| 2             |  200    | 2019-09-10 |
| 1             |  600    | 2019-09-02 |
+---------------+---------+------------+
注意,客户2在10月份的账户余额没有更新

我想得到每个客户每月的最后一笔账户余额。如果给定月份内没有客户的帐户余额更新,则应将上一次已知的帐户余额转移到当月。结果应该是这样的:

+---------------+---------+------------+
|  customer_id  |  value  | timestamp  |
+---------------+---------+------------+
| 1             |  -500   | 2019-10-12 |
| 2             |  200    | 2019-10-10 |
| 2             |  200    | 2019-09-10 |
| 1             |  600    | 2019-09-02 |
+---------------+---------+------------+

由于客户2的账户余额没有在10月份更新,而是在9月份更新,因此我们创建了9月份行的副本,将日期更改为10月份。如何在BigQuery中实现这一点,您有什么想法吗?

以下查询应该主要通过为每个客户创建每月的“月末”记录并获取最新余额来回答您的问题:

with 

-- Generate a set of months
month_begins as (
  select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),

-- Get the month ends
month_ends as (
  select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),

--  Cross Join and group so we get 1 customer record for every month to account for 
--  situations where customer doesn't change balance in a month
user_month_ends as (
  select
    customer_id,
    month_end_date
  from `project.dataset.table`
  cross join month_ends
  group by 1,2
),

--  Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
  select
    customer_id,
    value,
    timestamp,
    month_end_date
  from `project.dataset.table`
  inner join user_month_ends using(customer_id)
  where timestamp <= month_end_date
),

-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
  select
    *,
    row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
  from values_prior_to_month_end
),

-- Finally, select only the most recent record for each customer per month
final as (
  select
    * except(my_row)
  from ordered
  where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
有几点需要注意:

我没有订购与您期望的结果集相匹配的结果,我还保留了一个月末日期来说明这个概念。您可以轻松更改顺序并排除不需要的字段。 在CTE开始的月份,我设置了未来月份的范围,因此您的结果集将包含“未来月份”的最新余额。为了使这一点更漂亮,请考虑将“2019-1201”改为“CurrutyDealDATE”,并且您的查询将始终返回到当前月份的末尾。 时间戳字段看起来是日期,所以我使用了日期逻辑,但是如果基础字段是实际的时间戳,那么您应该能够应用相同的原则来使用时间戳逻辑。 在您的结果集中,我不确定为什么第二行客户2的时间戳为“2019-10-10”,这似乎是任意的,因为客户2没有第二次余额记录。 我特意将逻辑拆分为几个CTE,这样我可以更轻松地对每个步骤进行注释,您肯定可以在同一代码块中执行多个步骤,以实现更精简的查询。
下面是BigQuery标准SQL

#standardSQL
WITH customers AS (
  SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
  SELECT month FROM (
    SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id, 
  IFNULL(value, LEAD(value) OVER(win)) value,  
  IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp  
FROM months, customers
LEFT JOIN (
  SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id, 
    ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].* 
  FROM `project.dataset.table` 
  GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
如果要应用于您问题中的样本数据,请参见下面的示例

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
  SELECT 1, -300, '2019-10-11' UNION ALL
  SELECT 1, -200, '2019-10-10' UNION ALL
  SELECT 2, 200, '2019-09-10' UNION ALL
  SELECT 2, 100, '2019-08-11' UNION ALL
  SELECT 2, 50, '2019-07-12' UNION ALL
  SELECT 1, 600, '2019-09-02' 
), customers AS (
  SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
  SELECT month FROM (
    SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id, 
  IFNULL(value, LEAD(value) OVER(win)) value,  
  IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp  
FROM months, customers
LEFT JOIN (
  SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id, 
    ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].* 
  FROM `project.dataset.table` 
  GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id   
结果是

Row customer_id value   timestamp    
1   1           -500    2019-10-12   
2   2           200     2019-10-10   
3   1           600     2019-09-02   
4   2           200     2019-09-10   
5   1           null    null     
6   2           100     2019-08-11   
7   1           null    null     
8   2           50      2019-07-12   

谢谢你,米哈伊尔!查询几乎可以正常工作。使用修改后的示例数据检查您的查询:问题是,2019-07-02返回600,但该值记录在2019-09-02