Python 针对另一列检查数据帧中多个列的性能方法_Python_Pandas

Python 针对另一列检查数据帧中多个列的性能方法

python pandas

Python 针对另一列检查数据帧中多个列的性能方法,python,pandas,Python,Pandas,给定字符串的df，因此： id v0 v1 v2 v3 v4 0 1 '10' '5' '10' '22' '50' 1 2 '22' '23' '55' '60' '50' 2 3 '8' '2' '40' '80' '110' 3 4 '15' '15' '25' '100' '101' 我需要检查列v0:4中的值是否位于单独的第二个数据帧的ID列中： I

给定字符串的df，因此：

   id   v0    v1    v2     v3     v4
0   1  '10'   '5'  '10'   '22'   '50'   
1   2  '22'  '23'  '55'   '60'   '50'   
2   3   '8'   '2'  '40'   '80'  '110'  
3   4  '15'  '15'  '25'  '100'  '101'

我需要检查列v0:4中的值是否位于单独的第二个数据帧的ID列中：

     ID   State 
0   '10'   'TX'
1   '40'   'VT'
2   '3'    'FL'
3   '15'   'CA'

如果是，我想返回作为新列存在的值：

    v0    v1    v2     v3     v4      matches
0  '10'   '5'  '10'   '22'   '50'   ['10','10']
1  '22'  '23'  '55'   '60'   '50'   ['']
2   '8'   '2'  '40'   '80'  '110'   ['40']
3  '15'  '15'  '25'  '100'  '101'   ['15','15']

我打算在该df上使用df.explode，然后左键将其连接到第二个数据帧

目前，我正在这样做：

def match_finder(some_list):
    good_list = []
    for x in some_list:
        if second_df['ID'].str.contains(x).any():
            good_list.append(x)
            continue
        else:
            pass
    return good_list

df['matches'] = [
    match_finder([df.iloc[x]["v0"], df.iloc[x]["v1"], df.iloc[x]["v2"], df.iloc[x]["v3"], df.iloc[x]["v4"]])
    for x in range(len(df))]

这不会抛出错误，但速度非常慢。

您可以使用

where

和

isin

立即选择df v0:4中第二个df ID中的所有值，然后

堆栈

删除

nan

，

groupby

level=0（df的原始索引）和

agg

作为

列表。要添加缺少的值（如第二行），可以使用原始索引df的reindex

df['matches'] = (df.filter(regex='v\d')
                   .where(lambda x: x.isin(second_df['ID'].to_numpy()))
                   .stack()
                   .groupby(level=0).agg(list)
                   .reindex(df.index, fill_value=[])
                )
print(df)
   id    v0    v1    v2     v3     v4       matches
0   1  '10'   '5'  '10'   '22'   '50'  ['10', '10']
1   2  '22'  '23'  '55'   '60'   '50'            []
2   3   '8'   '2'  '40'   '80'  '110'        ['40']
3   4  '15'  '15'  '25'  '100'  '101'  ['15', '15']

您可以在识别潜在匹配后使用轴1上的聚合，并使用
添加第二个df的示例。如果您需要该列表，事情将相对缓慢。有许多其他方法可以存储信息而不在每个单元格中存储复杂对象。例如，一个非常有效的检查是检查isin
，然后mask
数据帧值：df1.set_index（'id'）[df1.set_index（'id'）.isin（df2['id'].to_numpy（））。相同的信息，但避免了缓慢聚合到列表的列。我需要阅读到\u numpy，谢谢。非常快。对于大于200k行的df，此操作在18秒内完成。
v = df.loc[:,'v0':'v4']
m = v.isin(s['ID'].to_numpy()) # s is second datafrane
df['match'] = v[m].agg(lambda x: x[x.notna()].tolist(),axis=1)

   id    v0    v1    v2     v3     v4       matches
0   1  '10'   '5'  '10'   '22'   '50'  ['10', '10']
1   2  '22'  '23'  '55'   '60'   '50'            []
2   3   '8'   '2'  '40'   '80'  '110'        ['40']
3   4  '15'  '15'  '25'  '100'  '101'  ['15', '15']