Python 使用pandas.merge\u asof进行完全外部连接

Python 使用pandas.merge\u asof进行完全外部连接,python,pandas,dataframe,merge,outer-join,Python,Pandas,Dataframe,Merge,Outer Join,您好,我需要将一些时间序列数据与最近的时间戳对齐,因此我认为pandas.merge\u asof可能是一个很好的候选者。但是,它没有在标准的merge方法中设置how='outer'的选项 例如: df1: df2: 然后,例如,执行以下操作: pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest', tolerance=pd.Timedelta('0.3s')) 结果将是:

您好,我需要将一些时间序列数据与最近的时间戳对齐,因此我认为
pandas.merge\u asof
可能是一个很好的候选者。但是,它没有在标准的
merge
方法中设置
how='outer'
的选项

例如:

df1:

df2:

然后,例如,执行以下操作:

pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest', tolerance=pd.Timedelta('0.3s'))
结果将是:

                               Value1  Value2
Time
2020-07-17 14:25:03.535906075     108   222.0
2020-07-17 14:25:05.457247019     110     NaN
2020-07-17 14:25:07.467777014     126    60.0
但我想要的是:

                               Value1  Value2
Time
2020-07-17 14:25:03.535906075     108   222.0
2020-07-17 14:25:04.545104980     NaN   150.0   <---- this is the difference
2020-07-17 14:25:05.457247019     110     NaN
2020-07-17 14:25:07.467777014     126    60.0
Value1值2
时间
2020-07-17 14:25:03.535906075     108   222.0
2020-07-17 14:25:04.545104980南150.0
  • 不幸的是,
    pd.merge\u asof
    中没有与
    pd.merge
    类似的
    how
    参数,否则您只需传递
    how='outer'
  • 作为一种解决方法,您可以手动添加另一个数据帧中不匹配的值
  • 然后,使用
    .sort\u index()


  • 这似乎很简单,但没有直接的解决办法。有一个选项可以再次合并,以引入缺少的行:

    # enumerate the rows of `df2` to later identify which are missing
    df2 = df2.reset_index().assign(idx=np.arange(df2.shape[0]))
    (pd.merge_asof(df1.reset_index(), 
                   df2[['Time','idx']], 
                  on='Time',
                  direction='nearest', 
                  tolerance=pd.Timedelta('0.3s'))
      .merge(df2, on='idx', how='outer')                        # merge back on row number
      .assign(Time=lambda x: x['Time_x'].fillna(x['Time_y']))   # fill the time
      .set_index(['Time'])                                      # set index back
      .drop(['Time_x','Time_y','idx'], axis=1)
      .sort_index()
    )
    
                                   Value1  Value2
    Time                                         
    2020-07-17 14:25:03.535906075   108.0   222.0
    2020-07-17 14:25:04.545104980     NaN   150.0
    2020-07-17 14:25:05.457247019   110.0     NaN
    2020-07-17 14:25:07.467777014   126.0    60.0
    

    嗨,谢谢!您认为什么是合并2个以上数据帧的好方法?请查看更新的问题。@circle999需要其他解决方案。你能创建一个新问题并引用回这个问题吗?您可以复制和粘贴所有数据,并添加多个示例数据帧(如3而不是2)。一般不赞成像这样更新问题。嗨,谢谢!您认为什么是合并2个以上数据帧的好方法?请查看更新的问题。
                                   Value1  Value2
    Time
    2020-07-17 14:25:03.535906075     108   222.0
    2020-07-17 14:25:04.545104980     NaN   150.0   <---- this is the difference
    2020-07-17 14:25:05.457247019     110     NaN
    2020-07-17 14:25:07.467777014     126    60.0
    
    df3 = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest', tolerance=pd.Timedelta('0.3s'))
    df4 = pd.merge_asof(df2, df1, left_index=True, right_index=True, direction='nearest', tolerance=pd.Timedelta('0.3s'))
    df5 = df3.append(df4[df4['Value1'].isnull()]).sort_index()
    df5
    Out[1]: 
                                   Value1  Value2
    Time                                         
    2020-07-17 14:25:03.535906075   108.0   222.0
    2020-07-17 14:25:04.545104980     NaN   150.0
    2020-07-17 14:25:05.457247019   110.0     NaN
    2020-07-17 14:25:07.467777014   126.0    60.0
    
    # enumerate the rows of `df2` to later identify which are missing
    df2 = df2.reset_index().assign(idx=np.arange(df2.shape[0]))
    (pd.merge_asof(df1.reset_index(), 
                   df2[['Time','idx']], 
                  on='Time',
                  direction='nearest', 
                  tolerance=pd.Timedelta('0.3s'))
      .merge(df2, on='idx', how='outer')                        # merge back on row number
      .assign(Time=lambda x: x['Time_x'].fillna(x['Time_y']))   # fill the time
      .set_index(['Time'])                                      # set index back
      .drop(['Time_x','Time_y','idx'], axis=1)
      .sort_index()
    )
    
                                   Value1  Value2
    Time                                         
    2020-07-17 14:25:03.535906075   108.0   222.0
    2020-07-17 14:25:04.545104980     NaN   150.0
    2020-07-17 14:25:05.457247019   110.0     NaN
    2020-07-17 14:25:07.467777014   126.0    60.0