Python,在两个数据帧中匹配和查找内容
检查一个数据帧中的内容是否也在另一个数据帧中 原始数据帧有两列,ID和相应的结果。还有另一个不同大小的数据框(行数和列数) 在原始数据帧中,如果ID与ID_1匹配,并且ID的对应结果在ID_1的对应内容或内容_1中,则创建一个新列来指示它。(想要的输出在这个问题的末尾) 我尝试合并两个数据帧以进行进一步操作。到目前为止,我有:Python,在两个数据帧中匹配和查找内容,python,pandas,dataframe,Python,Pandas,Dataframe,检查一个数据帧中的内容是否也在另一个数据帧中 原始数据帧有两列,ID和相应的结果。还有另一个不同大小的数据框(行数和列数) 在原始数据帧中,如果ID与ID_1匹配,并且ID的对应结果在ID_1的对应内容或内容_1中,则创建一个新列来指示它。(想要的输出在这个问题的末尾) 我尝试合并两个数据帧以进行进一步操作。到目前为止,我有: import pandas as pd data = {'ID': ["4589", "14805", "23591", "47089", "56251", "8596
import pandas as pd
data = {'ID': ["4589", "14805", "23591", "47089", "56251", "85964", "235225", "322624", "342225", "380689", "480562", "5623", "85624", "866278"],
'Fruit' : ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]
}
data_1 = {'ID_1': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"],
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "Khato Dosh", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "Kaomianjin", "Lai fun", "Lamian", "Liangpi", "who wants Custard Apple", "Misua", "nana Coconut", "Damson", "Paomo", "Ramen", "Rice vermicelli"]
}
df = pd.DataFrame(data)
df = df[['ID', 'Fruit']]
df_1 = pd.DataFrame(data_1)
df_1 = df_1[['ID_1', 'Content', 'Content_1']]
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
for index, row in result.iterrows():
if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
print row["ID"] + row["Fruit"]
它给了我TypeError:类型为“float”的参数是不可编辑的
(我使用的Pandas版本是v.0.20.3。)
我如何才能实现它?谢谢。
我认为需要:
#swap DataFrames with left join
result = df_1.merge(df, left_on = 'ID_1', right_on = 'ID', how = 'left')
#remove NaNs and create pattern with word boundary for check substrings
pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))
#boolan mask - rewritten iterrows to vectorized way
mask = ((result["ID"] == result["ID_1"]) &
result["Content"].str.contains(pat, na=False) |
result["Content_1"].str.contains(pat, na=False))
#remove unnecessary columns
result = result.drop(['ID','Fruit'], axis=1)
#add indicator column
result['matched'] = np.where(mask, 'Y', '')
带有
外部连接的旧解决方案:
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))
mask = ((result["ID"] == result["ID_1"]) &
result["Content"].str.contains(pat, na=False)|
result["Content_1"].str.contains(pat, na=False))
result['matched'] = np.where(mask, 'Y', '')
在某些情况下,行[“Content”]
和行[“Content_1”]
的内容是NaN
NaN
是一个float
,它也是不可编辑的-这就是为什么会出现错误
您可以使用try
/来捕获以下内容:
for index, row in result.iterrows():
try:
if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
print( str(row["ID"]) + row["Fruit"])
except TypeError as e:
print(e, "for:")
print(row)
我认为你的合并工作得很好。要获得指定的输出,只需添加一个检查NaN
值的Matched
列:
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
result["Matched"] = np.where(result.isnull().any(axis=1), "N", "Y")
result
ID Fruit ID_1 Content \
0 4589 Avocado NaN NaN
1 14805 Blackberry 14805 this is Blackberry
2 23591 Black Sapote 23591 Khara Beruin
3 47089 Fingered Citron NaN NaN
4 56251 Crab Apples 56251 Lapha
5 85964 Custard Apple 85964 Loha Sura
Content_1 Matched
0 NaN N
1 Kaomianjin Y
2 Lai fun Y
3 NaN N
4 Liangpi Y
5 who wants Custard Apple Y
你好,先生!你总是在那里帮助解决数据框问题!这是一次完美的学习之旅。祝你周末愉快@马克-不客气,刚刚添加了另一个解决方案,我希望更好。哇!一个班轮有效!您能理解我选择了前面的回答,并给出了两种解决方案吗?没问题-我认为@Jezrael只是更加仔细和完整。他的回答很好!
for index, row in result.iterrows():
try:
if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
print( str(row["ID"]) + row["Fruit"])
except TypeError as e:
print(e, "for:")
print(row)
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
result["Matched"] = np.where(result.isnull().any(axis=1), "N", "Y")
result
ID Fruit ID_1 Content \
0 4589 Avocado NaN NaN
1 14805 Blackberry 14805 this is Blackberry
2 23591 Black Sapote 23591 Khara Beruin
3 47089 Fingered Citron NaN NaN
4 56251 Crab Apples 56251 Lapha
5 85964 Custard Apple 85964 Loha Sura
Content_1 Matched
0 NaN N
1 Kaomianjin Y
2 Lai fun Y
3 NaN N
4 Liangpi Y
5 who wants Custard Apple Y