Sql 在BigQuery中,如何用每个用户的线性区间填充不规则缺失的时间序列值?
对于每个用户,,我有数据不规则地缺少时间序列值,我希望使用BigQuery标准SQL以一定的间隔使用线性插值对其进行转换Sql 在BigQuery中,如何用每个用户的线性区间填充不规则缺失的时间序列值?,sql,google-bigquery,interpolation,missing-data,Sql,Google Bigquery,Interpolation,Missing Data,对于每个用户,,我有数据不规则地缺少时间序列值,我希望使用BigQuery标准SQL以一定的间隔使用线性插值对其进行转换 +------+---------------------+-------+ | name | time | value | +------+---------------------+-------+ | Jane | 2020-11-14 09:01:00 | 3 | | Jane | 2020-11-14 09:05:00 |
+------+---------------------+-------+
| name | time | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 | 3 |
| Jane | 2020-11-14 09:05:00 | 5 |
| Jane | 2020-11-14 09:07:00 | 1 |
| Jane | 2020-11-14 09:09:00 | 8 |
| Jane | 2020-11-14 09:10:00 | 4 |
| Kay | 2020-11-14 09:01:00 | 7 |
| Kay | 2020-11-14 09:04:00 | 1 |
| Kay | 2020-11-14 09:05:00 | 10 |
| Kay | 2020-11-14 09:09:00 | 6 |
| Kay | 2020-11-14 09:10:00 | 7 |
+------+---------------------+-------+
我想把它转换成如下:
+------+---------------------+-------+-----------------+
| name | time | value | |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3 | |
| Jane | 2020-11-14 09:02:00 | 3.5 | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4 | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5 | |
| Jane | 2020-11-14 09:06:00 | 3 | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1 | |
| Jane | 2020-11-14 09:08:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8 | |
| Jane | 2020-11-14 09:10:00 | 4 | |
| Kay | 2020-11-14 09:01:00 | 7 | |
| Kay | 2020-11-14 09:02:00 | 5 | <= interpolaetd |
| Kay | 2020-11-14 09:03:00 | 3 | <= interpolaetd |
| Kay | 2020-11-14 09:04:00 | 1 | |
| Kay | 2020-11-14 09:05:00 | 10 | |
| Kay | 2020-11-14 09:06:00 | 9 | <= interpolaetd |
| Kay | 2020-11-14 09:07:00 | 8 | <= interpolaetd |
| Kay | 2020-11-14 09:08:00 | 7 | <= interpolaetd |
| Kay | 2020-11-14 09:09:00 | 6 | |
| Kay | 2020-11-14 09:10:00 | 7 | |
+------+---------------------+-------+-----------------+
+------+---------------------+-------+-----------------+
|名称|时间|价值||
+------+---------------------+-------+-----------------+
|简| 2020-11-1409:01:00 | 3 ||
|Jane | 2020-11-14 09:02:00 | 3.5 |这与你之前的问题没有太大区别。从接受的答案开始,您可以执行以下操作:
select name, time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_array(min(time), max(time)) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
这与你之前的问题没有太大区别。从接受的答案开始,您可以执行以下操作:
select name, time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_array(min(time), max(time)) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
下面是BigQuerySQL的示例
#standardSQL
select name, time,
ifnull(value, start_value
+ (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t
using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
如果要应用于问题中的样本数据,则输出为
下面是针对BigQuery SQL的
#standardSQL
select name, time,
ifnull(value, start_value
+ (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t
using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
如果要应用于问题中的样本数据,则输出为