Python pd.merge_asof每个时间段有多个匹配项？_Python_Pandas_Dataframe

Python pd.merge_asof每个时间段有多个匹配项？

python pandas dataframe

Python pd.merge_asof每个时间段有多个匹配项？,python,pandas,dataframe,Python,Pandas,Dataframe,我正在尝试通过多个匹配项按时间合并两个数据帧。我正在寻找所有df2实例，它们的时间戳在df1中的endofweek之前7天或更短。可能有多条记录符合这种情况，我想要所有的匹配，而不仅仅是第一条或最后一条（pd.merge\u asof就是这么做的）我试过了 pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', l

我正在尝试通过多个匹配项按时间合并两个数据帧。我正在寻找所有df2实例，它们的

时间戳

在df1中的

endofweek

之前7天或更短。可能有多条记录符合这种情况，我想要所有的匹配，而不仅仅是第一条或最后一条（pd.merge\u asof就是这么做的）

我试过了

pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')

但这让我很生气

   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
1 2019-08-31      8679                 NaT       NaN     NaN
2 2019-09-07      1234                 NaT       NaN     NaN
3 2019-09-07      8679                 NaT       NaN     NaN
4 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

我把文字弄丢了。有没有办法为

pd.merge\u asof

做一种外部连接，在这里我可以保留

df2

的所有实例，而不仅仅是第一个或最后一个

我的理想结果如下所示（假设

endofweek

时间在该日期被视为00:00:00）：

您应该将

方法更改为最近的
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='nearest'
              , left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
Out[106]: 
   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
1 2019-08-31      8679                 NaT       NaN     NaN
2 2019-09-07      1234 2019-09-08 14:00:00    1234.0  1234_3
3 2019-09-07      8679 2019-09-07 12:00:00    8679.0  8679_1
4 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

我尝试的一种方法是在一个数据帧上使用groupby
，然后在pd中对另一个数据帧进行子集设置。按顺序合并
：
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))], 
left_on='endofweek', right_on='timestamp')))

merged

                       endofweek  GroupCol           timestamp  GroupVal TextVal
GroupCol endofweek
1234     2019-08-31 0        NaT       NaN 2019-08-30 10:00:00    1234.0  1234_1
                    1        NaT       NaN 2019-08-30 10:30:00    1234.0  1234_2
                    2 2019-08-31    1234.0                 NaT       NaN     NaN
         2019-09-07 0 2019-09-07    1234.0                 NaT       NaN     NaN
         2019-09-14 0        NaT       NaN 2019-09-08 14:00:00    1234.0  1234_3
                    1 2019-09-14    1234.0                 NaT       NaN     NaN
8679     2019-08-31 0 2019-08-31    8679.0                 NaT       NaN     NaN
         2019-09-07 0 2019-09-07    8679.0                 NaT       NaN     NaN
         2019-09-14 0        NaT       NaN 2019-09-07 12:00:00    8679.0  8679_1
                    1 2019-09-14    8679.0                 NaT       NaN     NaN

merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))

merged.reset_index(drop=True, inplace=True)

merged
   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31    1234.0 2019-08-30 10:00:00    1234.0  1234_1
1 2019-08-31    1234.0 2019-08-30 10:30:00    1234.0  1234_2
2 2019-08-31    1234.0                 NaT       NaN     NaN
3 2019-09-07    1234.0                 NaT       NaN     NaN
4 2019-09-14    1234.0 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14    1234.0                 NaT       NaN     NaN
6 2019-08-31    8679.0                 NaT       NaN     NaN
7 2019-09-07    8679.0                 NaT       NaN     NaN
8 2019-09-14    8679.0 2019-09-07 12:00:00    8679.0  8679_1
9 2019-09-14    8679.0                 NaT       NaN     NaN

merged=（df1.groupby（['GroupCol'，'endofweek']）。
应用（lambda x:pd.merge_有序（x，df2[(
（df2['GroupVal']==x.name[0]）
&（abs（df2['timestamp']-x.name[1]）pd.merge\u asof
只执行左连接。在尝试加速groupby
/merge\u ordered
示例时遇到了很多挫折之后，在两个数据源的不同方向上执行pd.merge\u asof
，然后执行外部连接以将它们组合起来，这样做更直观、更快
left\u merge=pd.merge\u asof（df1，df2，
公差=pd.Timedelta（'7d'），方向为向后，
左在周末，右在时间戳，
左侧由class='GroupCol'编辑，右侧由class='GroupVal'编辑）
右合并=pd.merge\u asof（df2，df1，
公差=pd.Timedelta（'7d'），方向='forward'，
左上class='timestamp'，右上class='endofweek'，
左侧由class='GroupVal'编辑，右侧由class='GroupCol'编辑）
合并=（left\u merge.merge（right\u merge，how=“outer”）
.sort_值（['endofweek'，'GroupCol'，'timestamp']））
.reset_索引（drop=True））
合并
endofweek GroupCol时间戳GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:00:00    1234.0  1234_1
1 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
2019年8月31日8679纳南
3 2019-09-07 1234纳南
4 2019-09-07 8679纳南
5 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
6 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

此外，它比我的另一个答案快得多：
导入时间
n=1000
开始=时间。时间（）
对于范围（n）中的i：
左合并=pd.merge\U asof（df1，df2，
公差=pd.Timedelta（'7d'），方向为向后，
左在周末，右在时间戳，
左侧由class='GroupCol'编辑，右侧由class='GroupVal'编辑）
右合并=pd.merge\u asof（df2，df1，
公差=pd.Timedelta（'7d'），方向='forward'，
左上class='timestamp'，右上class='endofweek'，
左侧由class='GroupVal'编辑，右侧由class='GroupCol'编辑）
合并=（left\u merge.merge（right\u merge，how=“outer”）
.sort_值（['endofweek'，'GroupCol'，'timestamp']））
.reset_索引（drop=True））
end=time.time（）
结束-开始
15.040804386138916
这对我的目的不起作用。Endofweek始终是一周的结束，并且应该始终大于或等于时间戳。我已编辑了我的问题，以使我想要的结果更清楚。
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='nearest'
              , left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
Out[106]: 
   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31      1234 2019-08-30 10:30:00    1234.0  1234_2
1 2019-08-31      8679                 NaT       NaN     NaN
2 2019-09-07      1234 2019-09-08 14:00:00    1234.0  1234_3
3 2019-09-07      8679 2019-09-07 12:00:00    8679.0  8679_1
4 2019-09-14      1234 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14      8679 2019-09-07 12:00:00    8679.0  8679_1

merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))], 
left_on='endofweek', right_on='timestamp')))

merged

                       endofweek  GroupCol           timestamp  GroupVal TextVal
GroupCol endofweek
1234     2019-08-31 0        NaT       NaN 2019-08-30 10:00:00    1234.0  1234_1
                    1        NaT       NaN 2019-08-30 10:30:00    1234.0  1234_2
                    2 2019-08-31    1234.0                 NaT       NaN     NaN
         2019-09-07 0 2019-09-07    1234.0                 NaT       NaN     NaN
         2019-09-14 0        NaT       NaN 2019-09-08 14:00:00    1234.0  1234_3
                    1 2019-09-14    1234.0                 NaT       NaN     NaN
8679     2019-08-31 0 2019-08-31    8679.0                 NaT       NaN     NaN
         2019-09-07 0 2019-09-07    8679.0                 NaT       NaN     NaN
         2019-09-14 0        NaT       NaN 2019-09-07 12:00:00    8679.0  8679_1
                    1 2019-09-14    8679.0                 NaT       NaN     NaN

merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))

merged.reset_index(drop=True, inplace=True)

merged
   endofweek  GroupCol           timestamp  GroupVal TextVal
0 2019-08-31    1234.0 2019-08-30 10:00:00    1234.0  1234_1
1 2019-08-31    1234.0 2019-08-30 10:30:00    1234.0  1234_2
2 2019-08-31    1234.0                 NaT       NaN     NaN
3 2019-09-07    1234.0                 NaT       NaN     NaN
4 2019-09-14    1234.0 2019-09-08 14:00:00    1234.0  1234_3
5 2019-09-14    1234.0                 NaT       NaN     NaN
6 2019-08-31    8679.0                 NaT       NaN     NaN
7 2019-09-07    8679.0                 NaT       NaN     NaN
8 2019-09-14    8679.0 2019-09-07 12:00:00    8679.0  8679_1
9 2019-09-14    8679.0                 NaT       NaN     NaN

import time
n=1000
start=time.time()
for i in range(n):
    merged = (df1.groupby(['GroupCol', 'endofweek']).
    apply(lambda x: pd.merge_ordered(x, df2[(
    (df2['GroupVal']==x.name[0])
    &(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))], 
    left_on='endofweek', right_on='timestamp')))

end = time.time()

end-start
40.72932052612305