Google bigquery 在BigQuery中每月获取最后一条已知记录
账户余额集合,显示客户在给定日期的账户余额:Google bigquery 在BigQuery中每月获取最后一条已知记录,google-bigquery,Google Bigquery,账户余额集合,显示客户在给定日期的账户余额: +---------------+---------+------------+ | customer_id | value | timestamp | +---------------+---------+------------+ | 1 | -500 | 2019-10-12 | | 1 | -300 | 2019-10-11 | | 1 | -20
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
注意,客户2在10月份的账户余额没有更新
我想得到每个客户每月的最后一笔账户余额。如果给定月份内没有客户的帐户余额更新,则应将上一次已知的帐户余额转移到当月。结果应该是这样的:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
由于客户2的账户余额没有在10月份更新,而是在9月份更新,因此我们创建了9月份行的副本,将日期更改为10月份。如何在BigQuery中实现这一点,您有什么想法吗?以下查询应该主要通过为每个客户创建每月的“月末”记录并获取最新余额来回答您的问题:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
有几点需要注意:
我没有订购与您期望的结果集相匹配的结果,我还保留了一个月末日期来说明这个概念。您可以轻松更改顺序并排除不需要的字段。
在CTE开始的月份,我设置了未来月份的范围,因此您的结果集将包含“未来月份”的最新余额。为了使这一点更漂亮,请考虑将“2019-1201”改为“CurrutyDealDATE”,并且您的查询将始终返回到当前月份的末尾。
时间戳字段看起来是日期,所以我使用了日期逻辑,但是如果基础字段是实际的时间戳,那么您应该能够应用相同的原则来使用时间戳逻辑。
在您的结果集中,我不确定为什么第二行客户2的时间戳为“2019-10-10”,这似乎是任意的,因为客户2没有第二次余额记录。
我特意将逻辑拆分为几个CTE,这样我可以更轻松地对每个步骤进行注释,您肯定可以在同一代码块中执行多个步骤,以实现更精简的查询。
下面是BigQuery标准SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
如果要应用于您问题中的样本数据,请参见下面的示例
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
结果是
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
谢谢你,米哈伊尔!查询几乎可以正常工作。使用修改后的示例数据检查您的查询:问题是,2019-07-02返回600,但该值记录在2019-09-02