Python Groupby和重新采样timeseries,使日期范围保持一致
我有一个数据帧,它基本上是几个时间序列叠加在一起。每个时间序列都有一个唯一的标签(组),并且它们具有不同的日期范围Python Groupby和重新采样timeseries,使日期范围保持一致,python,pandas,dataframe,time-series,pandas-groupby,Python,Pandas,Dataframe,Time Series,Pandas Groupby,我有一个数据帧,它基本上是几个时间序列叠加在一起。每个时间序列都有一个唯一的标签(组),并且它们具有不同的日期范围 date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-06', '2010-01-01', '2010-01-03'])) group = [1,1,1,1, 2, 2] value = [1,2,3
date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03',
'2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
df
date group value
0 2010-01-01 1 1
1 2010-01-02 1 2
2 2010-01-03 1 3
3 2010-01-06 1 4
4 2010-01-01 2 5
5 2010-01-03 2 6
我想对数据进行重采样,这样每个日期和组的组合都有一个条目(如果当天没有观察到或超出日期范围,则将值填充到NaN)。示例输出为:
date group value
2010-01-01 1 1
2010-01-02 1 2
2010-01-03 1 3
2010-01-04 1 NaN
2010-01-05 1 NaN
2010-01-06 1 4
2010-01-01 2 5
2010-01-02 2 NaN
2010-01-03 2 6
2010-01-04 2 NaN
2010-01-05 2 NaN
2010-01-06 2 NaN
我有一个有效的解决方案,但我怀疑有更好的方法。我的解决方案是首先透视数据,然后取消堆栈、分组并重新采样。基本上,所有真正需要的是做一个groupby和重采样,但用整个日期列的max和min值指定重采样的max和min范围,但我看不出怎么做
df = (df.pivot(index='dates', columns='groups', values='values')
.unstack()
.reset_index()
.set_index('dates')
.groupby('groups').resample('D').asfreq()
.drop('groups', axis=1)
.reset_index()
.rename(columns={0:'values'}))[['dates','groups', 'values']]
由于日期正确,这要归功于。我编辑了我的帖子来纠正我的错误
设置索引,然后使用
pandas.MultiIndex.from_product
生成值的笛卡尔乘积。我还使用fill\u value=0
来填充缺少的值
d = df.set_index(['date', 'group'])
midx = pd.MultiIndex.from_product(
[pd.date_range(df.date.min(), df.date.max()), df.group.unique()],
names=d.index.names
)
d.reindex(midx, fill_value=0).reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
或
我们可以做的另一个舞蹈是OP尝试的一个清理版本。我再次使用
fill\u value=0
来填充缺少的值。我们可以省去它来生成NaN
df.set_index(['date', 'group']) \
.unstack(fill_value=0) \
.asfreq('D', fill_value=0) \
.stack().reset_index()
date group value
0 2010-01-01 1 1
1 2010-01-01 2 5
2 2010-01-02 1 2
3 2010-01-02 2 0
4 2010-01-03 1 3
5 2010-01-03 2 6
6 2010-01-04 1 0
7 2010-01-04 2 0
8 2010-01-05 1 0
9 2010-01-05 2 0
10 2010-01-06 1 4
11 2010-01-06 2 0
或
另一种方式:
import pandas as pd
from itertools import product
date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03',
'2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
dates = pd.date_range(df.date.min(), df.date.max())
groups = df.group.unique()
df = (pd.DataFrame(list(product(dates, groups)), columns=['date', 'group'])
.merge(df, on=['date', 'group'], how='left')
.sort_values(['group', 'date'])
.reset_index(drop=True))
df
# date group value
#0 2010-01-01 1 1.0
#1 2010-01-02 1 2.0
#2 2010-01-03 1 3.0
#3 2010-01-04 1 NaN
#4 2010-01-05 1 NaN
#5 2010-01-06 1 4.0
#6 2010-01-01 2 5.0
#7 2010-01-02 2 NaN
#8 2010-01-03 2 6.0
#9 2010-01-04 2 NaN
#10 2010-01-05 2 NaN
#11 2010-01-06 2 NaN
进行交叉积[日期]x[组],例如使用
merge
。要获取所有“可能的日期”,您可以根据date
列中的最小/最大值使用pandas.date\u range
。很抱歉,我将0改为NaN!这两种形式都适合我的需要,不用担心。我给了你们两个(-):
df.set_index(['date', 'group']) \
.unstack() \
.asfreq('D') \
.stack(dropna=False).reset_index()
date group value
0 2010-01-01 1 1.0
1 2010-01-01 2 5.0
2 2010-01-02 1 2.0
3 2010-01-02 2 NaN
4 2010-01-03 1 3.0
5 2010-01-03 2 6.0
6 2010-01-04 1 NaN
7 2010-01-04 2 NaN
8 2010-01-05 1 NaN
9 2010-01-05 2 NaN
10 2010-01-06 1 4.0
11 2010-01-06 2 NaN
import pandas as pd
from itertools import product
date = pd.to_datetime(pd.Series(['2010-01-01', '2010-01-02', '2010-01-03',
'2010-01-06', '2010-01-01', '2010-01-03']))
group = [1,1,1,1, 2, 2]
value = [1,2,3,4,5,6]
df = pd.DataFrame({'date':date, 'group':group, 'value':value})
dates = pd.date_range(df.date.min(), df.date.max())
groups = df.group.unique()
df = (pd.DataFrame(list(product(dates, groups)), columns=['date', 'group'])
.merge(df, on=['date', 'group'], how='left')
.sort_values(['group', 'date'])
.reset_index(drop=True))
df
# date group value
#0 2010-01-01 1 1.0
#1 2010-01-02 1 2.0
#2 2010-01-03 1 3.0
#3 2010-01-04 1 NaN
#4 2010-01-05 1 NaN
#5 2010-01-06 1 4.0
#6 2010-01-01 2 5.0
#7 2010-01-02 2 NaN
#8 2010-01-03 2 6.0
#9 2010-01-04 2 NaN
#10 2010-01-05 2 NaN
#11 2010-01-06 2 NaN