Python 基于另一个熊猫数据帧有条件地提取熊猫行
我有两个数据帧:Python 基于另一个熊猫数据帧有条件地提取熊猫行,python,pandas,indexing,dataframe,conditional-statements,Python,Pandas,Indexing,Dataframe,Conditional Statements,我有两个数据帧: df1: col1 col2 1 2 1 3 2 4 df2: col1 2 3 我想提取df1中的所有行,其中df1的col2不在df2的col1中。因此,在这种情况下: col1 col2 2 4 我首先尝试: df1[df1['col2'] not in df2['col1']] 但它返回: TypeError:“Series”对象是可变的,因此无法对其进行散列 然后我试着: df1[df1['col2'
df1:
col1 col2
1 2
1 3
2 4
df2:
col1
2
3
我想提取df1
中的所有行,其中df1
的col2
不在df2
的col1
中。因此,在这种情况下:
col1 col2
2 4
我首先尝试:
df1[df1['col2'] not in df2['col1']]
但它返回:
TypeError:“Series”对象是可变的,因此无法对其进行散列
然后我试着:
df1[df1['col2'] not in df2['col1'].tolist]
但它返回:
TypeError:类型为“instancemethod”的参数不可iterable
您可以与~
一起使用以反转布尔掩码:
print (df1['col2'].isin(df2['col1']))
0 True
1 True
2 False
Name: col2, dtype: bool
print (~df1['col2'].isin(df2['col1']))
0 False
1 False
2 True
Name: col2, dtype: bool
print (df1[~df1['col2'].isin(df2['col1'])])
col1 col2
2 2 4
计时:
In [8]: %timeit (df1.query('col2 not in @df2.col1'))
1000 loops, best of 3: 1.57 ms per loop
In [9]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1000 loops, best of 3: 466 µs per loop
使用方法:
更大DFs的时机:
In [44]: df1.shape
Out[44]: (30000000, 2)
In [45]: df2.shape
Out[45]: (20000000, 1)
In [46]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1 loop, best of 3: 5.56 s per loop
In [47]: %timeit (df1.query('col2 not in @df2.col1'))
1 loop, best of 3: 5.96 s per loop
In [44]: df1.shape
Out[44]: (30000000, 2)
In [45]: df2.shape
Out[45]: (20000000, 1)
In [46]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1 loop, best of 3: 5.56 s per loop
In [47]: %timeit (df1.query('col2 not in @df2.col1'))
1 loop, best of 3: 5.96 s per loop