Python pd.merge_asof每个时间段有多个匹配项?
我正在尝试通过多个匹配项按时间合并两个数据帧。我正在寻找所有df2实例,它们的Python pd.merge_asof每个时间段有多个匹配项?,python,pandas,dataframe,Python,Pandas,Dataframe,我正在尝试通过多个匹配项按时间合并两个数据帧。我正在寻找所有df2实例,它们的时间戳在df1中的endofweek之前7天或更短。可能有多条记录符合这种情况,我想要所有的匹配,而不仅仅是第一条或最后一条(pd.merge\u asof就是这么做的) 我试过了 pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', l
时间戳
在df1中的endofweek
之前7天或更短。可能有多条记录符合这种情况,我想要所有的匹配,而不仅仅是第一条或最后一条(pd.merge\u asof就是这么做的)
我试过了
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
但这让我很生气
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
1 2019-08-31 8679 NaT NaN NaN
2 2019-09-07 1234 NaT NaN NaN
3 2019-09-07 8679 NaT NaN NaN
4 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
我把文字弄丢了。有没有办法为pd.merge\u asof
做一种外部连接,在这里我可以保留df2
的所有实例,而不仅仅是第一个或最后一个
我的理想结果如下所示(假设endofweek
时间在该日期被视为00:00:00):
您应该将
方法更改为最近的
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='nearest'
, left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
Out[106]:
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
1 2019-08-31 8679 NaT NaN NaN
2 2019-09-07 1234 2019-09-08 14:00:00 1234.0 1234_3
3 2019-09-07 8679 2019-09-07 12:00:00 8679.0 8679_1
4 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
我尝试的一种方法是在一个数据帧上使用groupby
,然后在pd中对另一个数据帧进行子集设置。按顺序合并
:
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
merged
endofweek GroupCol timestamp GroupVal TextVal
GroupCol endofweek
1234 2019-08-31 0 NaT NaN 2019-08-30 10:00:00 1234.0 1234_1
1 NaT NaN 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
2019-09-07 0 2019-09-07 1234.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-08 14:00:00 1234.0 1234_3
1 2019-09-14 1234.0 NaT NaN NaN
8679 2019-08-31 0 2019-08-31 8679.0 NaT NaN NaN
2019-09-07 0 2019-09-07 8679.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-07 12:00:00 8679.0 8679_1
1 2019-09-14 8679.0 NaT NaN NaN
merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))
merged.reset_index(drop=True, inplace=True)
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234.0 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234.0 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
3 2019-09-07 1234.0 NaT NaN NaN
4 2019-09-14 1234.0 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 1234.0 NaT NaN NaN
6 2019-08-31 8679.0 NaT NaN NaN
7 2019-09-07 8679.0 NaT NaN NaN
8 2019-09-14 8679.0 2019-09-07 12:00:00 8679.0 8679_1
9 2019-09-14 8679.0 NaT NaN NaN
merged=(df1.groupby(['GroupCol','endofweek'])。
应用(lambda x:pd.merge_有序(x,df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])pd.merge\u asof
只执行左连接。在尝试加速groupby
/merge\u ordered
示例时遇到了很多挫折之后,在两个数据源的不同方向上执行pd.merge\u asof
,然后执行外部连接以将它们组合起来,这样做更直观、更快
left\u merge=pd.merge\u asof(df1,df2,
公差=pd.Timedelta('7d'),方向为向后,
左在周末,右在时间戳,
左侧由class='GroupCol'编辑,右侧由class='GroupVal'编辑)
右合并=pd.merge\u asof(df2,df1,
公差=pd.Timedelta('7d'),方向='forward',
左上class='timestamp',右上class='endofweek',
左侧由class='GroupVal'编辑,右侧由class='GroupCol'编辑)
合并=(left\u merge.merge(right\u merge,how=“outer”)
.sort_值(['endofweek','GroupCol','timestamp']))
.reset_索引(drop=True))
合并
endofweek GroupCol时间戳GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2019年8月31日8679纳南
3 2019-09-07 1234纳南
4 2019-09-07 8679纳南
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
此外,它比我的另一个答案快得多:
导入时间
n=1000
开始=时间。时间()
对于范围(n)中的i:
左合并=pd.merge\U asof(df1,df2,
公差=pd.Timedelta('7d'),方向为向后,
左在周末,右在时间戳,
左侧由class='GroupCol'编辑,右侧由class='GroupVal'编辑)
右合并=pd.merge\u asof(df2,df1,
公差=pd.Timedelta('7d'),方向='forward',
左上class='timestamp',右上class='endofweek',
左侧由class='GroupVal'编辑,右侧由class='GroupCol'编辑)
合并=(left\u merge.merge(right\u merge,how=“outer”)
.sort_值(['endofweek','GroupCol','timestamp']))
.reset_索引(drop=True))
end=time.time()
结束-开始
15.040804386138916
这对我的目的不起作用。Endofweek始终是一周的结束,并且应该始终大于或等于时间戳。我已编辑了我的问题,以使我想要的结果更清楚。
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='nearest'
, left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
Out[106]:
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
1 2019-08-31 8679 NaT NaN NaN
2 2019-09-07 1234 2019-09-08 14:00:00 1234.0 1234_3
3 2019-09-07 8679 2019-09-07 12:00:00 8679.0 8679_1
4 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
merged
endofweek GroupCol timestamp GroupVal TextVal
GroupCol endofweek
1234 2019-08-31 0 NaT NaN 2019-08-30 10:00:00 1234.0 1234_1
1 NaT NaN 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
2019-09-07 0 2019-09-07 1234.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-08 14:00:00 1234.0 1234_3
1 2019-09-14 1234.0 NaT NaN NaN
8679 2019-08-31 0 2019-08-31 8679.0 NaT NaN NaN
2019-09-07 0 2019-09-07 8679.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-07 12:00:00 8679.0 8679_1
1 2019-09-14 8679.0 NaT NaN NaN
merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))
merged.reset_index(drop=True, inplace=True)
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234.0 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234.0 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
3 2019-09-07 1234.0 NaT NaN NaN
4 2019-09-14 1234.0 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 1234.0 NaT NaN NaN
6 2019-08-31 8679.0 NaT NaN NaN
7 2019-09-07 8679.0 NaT NaN NaN
8 2019-09-14 8679.0 2019-09-07 12:00:00 8679.0 8679_1
9 2019-09-14 8679.0 NaT NaN NaN
import time
n=1000
start=time.time()
for i in range(n):
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
end = time.time()
end-start
40.72932052612305