Python,在两个数据帧中匹配和查找内容

Python,在两个数据帧中匹配和查找内容,python,pandas,dataframe,Python,Pandas,Dataframe,检查一个数据帧中的内容是否也在另一个数据帧中 原始数据帧有两列,ID和相应的结果。还有另一个不同大小的数据框(行数和列数) 在原始数据帧中,如果ID与ID_1匹配,并且ID的对应结果在ID_1的对应内容或内容_1中,则创建一个新列来指示它。(想要的输出在这个问题的末尾) 我尝试合并两个数据帧以进行进一步操作。到目前为止,我有: import pandas as pd data = {'ID': ["4589", "14805", "23591", "47089", "56251", "8596

检查一个数据帧中的内容是否也在另一个数据帧中

原始数据帧有两列,ID和相应的结果。还有另一个不同大小的数据框(行数和列数)

在原始数据帧中,如果ID与ID_1匹配,并且ID的对应结果在ID_1的对应内容或内容_1中,则创建一个新列来指示它。(想要的输出在这个问题的末尾)

我尝试合并两个数据帧以进行进一步操作。到目前为止,我有:

import pandas as pd

data = {'ID': ["4589", "14805", "23591", "47089", "56251", "85964", "235225", "322624", "342225", "380689", "480562", "5623", "85624", "866278"], 
'Fruit' : ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]
}

data_1 = {'ID_1': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"], 
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "Khato Dosh", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "Kaomianjin", "Lai fun", "Lamian", "Liangpi", "who wants Custard Apple", "Misua", "nana Coconut", "Damson", "Paomo", "Ramen", "Rice vermicelli"]
}

df = pd.DataFrame(data)
df = df[['ID', 'Fruit']]

df_1 = pd.DataFrame(data_1)
df_1 = df_1[['ID_1', 'Content', 'Content_1']]

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')

for index, row in result.iterrows():
    if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
        print row["ID"] + row["Fruit"]
它给了我TypeError:类型为“float”的参数是不可编辑的

(我使用的Pandas版本是v.0.20.3。)

我如何才能实现它?谢谢。

我认为需要:

#swap DataFrames with left join
result = df_1.merge(df, left_on = 'ID_1', right_on = 'ID', how = 'left')

#remove NaNs and create pattern with word boundary for check substrings
pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))

#boolan mask - rewritten iterrows to vectorized way
mask = ((result["ID"] == result["ID_1"]) & 
         result["Content"].str.contains(pat, na=False) |
         result["Content_1"].str.contains(pat, na=False))

#remove unnecessary columns
result = result.drop(['ID','Fruit'], axis=1)
#add indicator column
result['matched'] = np.where(mask, 'Y', '')


带有
外部连接的旧解决方案:

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')

pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))

mask = ((result["ID"] == result["ID_1"]) & 
         result["Content"].str.contains(pat, na=False)|     
         result["Content_1"].str.contains(pat, na=False))

result['matched'] = np.where(mask, 'Y', '')

在某些情况下,
行[“Content”]
行[“Content_1”]
的内容是
NaN
NaN
是一个
float
,它也是不可编辑的-这就是为什么会出现错误

您可以使用
try
/
来捕获以下内容:

for index, row in result.iterrows():
    try:
        if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
            print( str(row["ID"]) + row["Fruit"])
    except TypeError as e:
        print(e, "for:")
        print(row)
我认为你的合并工作得很好。要获得指定的输出,只需添加一个检查
NaN
值的
Matched
列:

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
result["Matched"] = np.where(result.isnull().any(axis=1), "N", "Y")

result

        ID            Fruit    ID_1             Content  \
0     4589          Avocado     NaN                 NaN   
1    14805       Blackberry   14805  this is Blackberry   
2    23591     Black Sapote   23591        Khara Beruin   
3    47089  Fingered Citron     NaN                 NaN   
4    56251      Crab Apples   56251               Lapha   
5    85964    Custard Apple   85964           Loha Sura   

                  Content_1 Matched  
0                       NaN       N  
1                Kaomianjin       Y  
2                   Lai fun       Y  
3                       NaN       N  
4                   Liangpi       Y  
5   who wants Custard Apple       Y  

你好,先生!你总是在那里帮助解决数据框问题!这是一次完美的学习之旅。祝你周末愉快@马克-不客气,刚刚添加了另一个解决方案,我希望更好。哇!一个班轮有效!您能理解我选择了前面的回答,并给出了两种解决方案吗?没问题-我认为@Jezrael只是更加仔细和完整。他的回答很好!
for index, row in result.iterrows():
    try:
        if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
            print( str(row["ID"]) + row["Fruit"])
    except TypeError as e:
        print(e, "for:")
        print(row)
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
result["Matched"] = np.where(result.isnull().any(axis=1), "N", "Y")

result

        ID            Fruit    ID_1             Content  \
0     4589          Avocado     NaN                 NaN   
1    14805       Blackberry   14805  this is Blackberry   
2    23591     Black Sapote   23591        Khara Beruin   
3    47089  Fingered Citron     NaN                 NaN   
4    56251      Crab Apples   56251               Lapha   
5    85964    Custard Apple   85964           Loha Sura   

                  Content_1 Matched  
0                       NaN       N  
1                Kaomianjin       Y  
2                   Lai fun       Y  
3                       NaN       N  
4                   Liangpi       Y  
5   who wants Custard Apple       Y