Python 如何基于IntervalIndex对跳过的datetime值求和?
假设我有两个数据帧Python 如何基于IntervalIndex对跳过的datetime值求和?,python,pandas,datetime,dataframe,Python,Pandas,Datetime,Dataframe,假设我有两个数据帧df1和df2 date value 0 2018-01-23 10:02:00 10 1 2018-01-23 10:03:00 20 2 2018-01-23 10:04:00 30 3 2018-01-23 10:05:00 40 4 2018-01-23 10:16:00 50 5 2018-01-23 10:17:00 60 在df1中 date value 0 2
df1
和df2
date value
0 2018-01-23 10:02:00 10
1 2018-01-23 10:03:00 20
2 2018-01-23 10:04:00 30
3 2018-01-23 10:05:00 40
4 2018-01-23 10:16:00 50
5 2018-01-23 10:17:00 60
在df1中
date value
0 2018-01-23 10:00:00 10
1 2018-01-23 10:05:00 20
2 2018-01-23 10:10:00 30
3 2018-01-23 10:15:00 40
4 2018-01-23 10:20:00 50
在df2中
date value
0 2018-01-23 10:02:00 10
1 2018-01-23 10:03:00 20
2 2018-01-23 10:04:00 30
3 2018-01-23 10:05:00 40
4 2018-01-23 10:16:00 50
5 2018-01-23 10:17:00 60
首先,我根据df1.date
得到IntervalIndex(左关闭,右打开),对于每个间隔,我需要计算df2.value
的总和,并将总和映射到df1
编辑:
我使用的代码是:
shift_date = df1.date.shift(-1)
shift_date[-1] = df1.date.iloc[-2] + timedelta(minutes=5) #avoid NaT
idx = pd.IntervalIndex.from_arrays(df1.date, shift_date, closed = "left")
df2_sum = df2.loc[idx.get_indexer(df1.date), 'value']
df2_sum = df2_sum.groupby(df2_sum.index).sum()
但是只得到映射到df2.index
的df1
的值
我要找的东西看起来像
date value df2_value
0 2018-01-23 10:00:00 10 60
1 2018-01-23 10:05:00 20 40
2 2018-01-23 10:10:00 30 0
3 2018-01-23 10:15:00 40 0
4 2018-01-23 10:20:00 50 110
首先创建
IntervalIndex
,并在将来某个日期(如2100-01-01
)删除NaT
fillna:
df1.index = pd.IntervalIndex.from_arrays(df1.date,
df1.date.shift(-1).fillna(pd.datetime(2100,1,1)),
closed = "left")
print (df1)
date value
[2018-01-23 10:00:00, 2018-01-23 10:05:00) 2018-01-23 10:00:00 10
[2018-01-23 10:05:00, 2018-01-23 10:10:00) 2018-01-23 10:05:00 20
[2018-01-23 10:10:00, 2018-01-23 10:15:00) 2018-01-23 10:10:00 30
[2018-01-23 10:15:00, 2018-01-23 10:20:00) 2018-01-23 10:15:00 40
[2018-01-23 10:20:00, 2100-01-01) 2018-01-23 10:20:00 50
然后与groupby和aggregate一起使用sum
:
df3 = df2.groupby(pd.cut(df2.date, bins=df1.index))['value'].sum().rename('df2_value')
print (df3)
date
[2018-01-23 10:00:00, 2018-01-23 10:05:00) 60
[2018-01-23 10:05:00, 2018-01-23 10:10:00) 40
[2018-01-23 10:10:00, 2018-01-23 10:15:00) 0
[2018-01-23 10:15:00, 2018-01-23 10:20:00) 110
[2018-01-23 10:20:00, 2100-01-01) 0
Name: df2_value, dtype: int64
这两个索引都相同,因此可以删除它并concat
:
df = pd.concat([df1.reset_index(drop=True), df3.reset_index(drop=True)], axis=1)
print (df)
date value df2_value
0 2018-01-23 10:00:00 10 60
1 2018-01-23 10:05:00 20 40
2 2018-01-23 10:10:00 30 0
3 2018-01-23 10:15:00 40 110
4 2018-01-23 10:20:00 50 0
简单一点:
ii = pd.IntervalIndex.from_breaks(df1['date'], closed='left')
res = df2.groupby(ii.get_indexer(df2['date']))['value'].sum()
df1['df2_value'] = res.reindex(df1.index, fill_value=0)
df1的结果输出:
date value df2_value
0 2018-01-23 10:00:00 10 60
1 2018-01-23 10:05:00 20 40
2 2018-01-23 10:10:00 30 0
3 2018-01-23 10:15:00 40 110
4 2018-01-23 10:20:00 50 0