Python 使用pandas/numpy将左连接及时矢量化_Python_Pandas_Numpy_Join_Vectorization

Python 使用pandas/numpy将左连接及时矢量化

python pandas numpy join

Python 使用pandas/numpy将左连接及时矢量化,python,pandas,numpy,join,vectorization,Python,Pandas,Numpy,Join,Vectorization,我有两个数据帧：x和y。我的目标是在x上左连接y，其中x。时间戳介于y最小值和最大值之间（并计算这些发生次数）在这种情况下，在每行上使用lambda函数是可行的，但是速度非常慢（将一个3行的表连接到一个70k行的表需要45到60秒） %%次 x['count']=\ x、应用（λr:len（y.loc[（y['min']=r['timestamp']）），轴=1） numpy中是否有方法将此联接矢量化，或者是否有其他建议使此联接运行更快（在5s以下）？对于同样具有重叠的通用解决方案，请首先

我有两个数据帧：

和

。我的目标是在x上左连接y，其中

x。时间戳

介于y最小值和最大值之间（并计算这些发生次数）

在这种情况下，在每行上使用lambda函数是可行的，但是速度非常慢（将一个3行的表连接到一个70k行的表需要45到60秒）

%%次
x['count']=\
x、 应用（λr:len（y.loc[（y['min']=r['timestamp']）），轴=1）

numpy

中是否有方法将此联接矢量化，或者是否有其他建议使此联接运行更快（在5s以下）？

对于同样具有重叠的通用解决方案，请首先使用交叉联接，然后按条件筛选行，最后添加计数匹配值的新列：

df=x.assign（a=1）。merge（y.assign（a=1），on='a'）
s=df.loc[（df['min']=df['timestamp']），'timestamp']
x['count']=x['timestamp'].map（s.value_counts（））.fillna（0）.astype（int）

在

中的时间戳和

中的最小/最大时间戳之间没有重叠。我必须更改

dataframe中的第一条记录：

>>> y
Out[124]: 
                  min                 max
0 2013-05-10 09:10:51 2013-06-02 10:27:44
1 2013-06-12 03:08:35 2013-06-12 03:08:35
2 2013-08-03 09:11:35 2021-01-26 23:05:17

但是，如果确实存在重叠，则可以使用

merge\u asof（）

进行合并：

%%time
x['count'] = \
    x.apply(lambda r: len(y.loc[(y['min']<=r['timestamp']) & (y['max']>=r['timestamp'])]), axis=1)

df = x.assign(a=1).merge(y.assign(a=1), on='a')
s = df.loc[(df['min']<=df['timestamp']) & (df['max']>=df['timestamp']), 'timestamp']

x['count'] = x['timestamp'].map(s.value_counts()).fillna(0).astype(int)

>>> y
Out[124]: 
                  min                 max
0 2013-05-10 09:10:51 2013-06-02 10:27:44
1 2013-06-12 03:08:35 2013-06-12 03:08:35
2 2013-08-03 09:11:35 2021-01-26 23:05:17

foo = pd.merge_asof(x, y, left_on='timestamp', right_on='min', direction='backward')
valid_idx = np.where(foo.timestamp >= foo['max'])[0]
new_cols = foo.loc[valid_idx, :]
foo = pd.merge(x, new_cols, left_index=True, right_index=True, suffixes=('_1', '_2'))