Python 3.x excel数据处理的模糊逻辑

Python 3.x excel数据处理的模糊逻辑,python-3.x,pandas,fuzzy,Python 3.x,Pandas,Fuzzy,我有两个数据帧DF(~100k行),这是一个原始数据文件和DF1(15k行),映射文件。我正在尝试将DF.address和DF.Name列与DF1.address和DF1.Name匹配。一旦找到匹配项,应在DF.ID中填充DF1.ID(如果DF1.ID不是None),否则应在DF.ID中填充DF1.top_ID 在模糊逻辑的帮助下,我能够匹配地址和名称,但我一直无法连接获得的结果来填充ID DF1映射文件 DF原始数据文件 IIUC:这里有一个解决方案: from fuzzywuzzy im

我有两个数据帧DF(~100k行),这是一个原始数据文件和DF1(15k行),映射文件。我正在尝试将DF.address和DF.Name列与DF1.address和DF1.Name匹配。一旦找到匹配项,应在DF.ID中填充DF1.ID(如果DF1.ID不是None),否则应在DF.ID中填充DF1.top_ID

在模糊逻辑的帮助下,我能够匹配地址和名称,但我一直无法连接获得的结果来填充ID

DF1映射文件

DF原始数据文件


IIUC:这里有一个解决方案:

from fuzzywuzzy import fuzz
import pandas as pd

#Read raw data from clipboard
raw = pd.read_clipboard()

#Read map data from clipboard
mp = pd.read_clipboard()

#Merge raw data and mp data as following 
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')

#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)

#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result

这有用于测试提供的解决方案的示例数据。

为什么不进行合并?例如:原始数据文件中的21 E Hollis st应该与映射文件中的21 E Hollis地址匹配。基本上,应该匹配这些场景以填充正确的ID。这就是为什么这里使用模糊查找。
from fuzzywuzzy import fuzz
import pandas as pd

#Read raw data from clipboard
raw = pd.read_clipboard()

#Read map data from clipboard
mp = pd.read_clipboard()

#Merge raw data and mp data as following 
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')

#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)

#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result