使用Python对同一数据帧中的两列进行模糊匹配
我在同一个数据框中有两个数据集,每个数据集显示一个公司列表。一个数据集来自2017年,另一个来自今年。我试图将这两个公司的数据集相互匹配,并认为模糊匹配(fuzzyfuzzy)是最好的方法。使用部分比率,我只想让列中的值如下所示:去年的公司名称,最高模糊匹配比率,今年的公司与最高分数关联。原始数据框被赋予变量“data”,去年的公司名称在“company”列下,今年的公司名称在“company name”列下。为了完成这项任务,我尝试使用extractOne模糊匹配过程创建一个函数,然后将该函数应用于数据帧中的每个值/行。然后,我会将结果添加到原始数据帧中 代码如下:使用Python对同一数据帧中的两列进行模糊匹配,python,pandas,fuzzywuzzy,Python,Pandas,Fuzzywuzzy,我在同一个数据框中有两个数据集,每个数据集显示一个公司列表。一个数据集来自2017年,另一个来自今年。我试图将这两个公司的数据集相互匹配,并认为模糊匹配(fuzzyfuzzy)是最好的方法。使用部分比率,我只想让列中的值如下所示:去年的公司名称,最高模糊匹配比率,今年的公司与最高分数关联。原始数据框被赋予变量“data”,去年的公司名称在“company”列下,今年的公司名称在“company name”列下。为了完成这项任务,我尝试使用extractOne模糊匹配过程创建一个函数,然后将该函数
names_array=[]
ratio_array=[]
def match_names(last_year,this_year):
for row in last_year:
x=process.extractOne(row,this_year)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
#last year company names dataset
last_year=data['Company'].dropna().values
#this year companydataset
this_year=data['Company name'].values
name_match,ratio_match=match_names(last_year,this_year)
data['this_year']=pd.Series(name_match)
data['match_rating']=pd.Series(ratio_match)
data.to_csv("test.csv")
但是,每次我执行这部分代码时,我创建的两个新增列都不会显示在csv中。事实上,“test.csv”只是与以前相同的数据帧,尽管计算机显示它是最近创建的。如果有人能指出问题或以任何方式帮助我,我将不胜感激
编辑(数据帧预览):
然后,在公司条目(去年公司数据集)结束后,“公司名称”列(今年公司数据集)开始如下:
4168 NaN LEWIS TENNIS
4169 NaN CHUCKS PRO SHOP AT
4170 NaN CHUCK KINYON
4171 NaN LAKE COUNTRY RACQUET CLUB
4172 NaN SPORTS ACADEMY & RAC CLUB
考虑到一列只在另一端开始一次,您的数据帧结构很奇怪,但是我们可以让它工作。让我们为您提供的
数据
获取以下示例数据帧:
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
11 NaN LEWIS TENNIS
12 NaN CHUCKS PRO SHOP AT
13 NaN CHUCK KINYON
14 NaN LAKE COUNTRY RACQUET CLUB
15 NaN SPORTS ACADEMY & RAC CLUB
然后执行匹配:
import pandas as pd
from fuzzywuzzy import process, fuzz
known_list = data['Company name'].dropna()
def find_match(x):
match = process.extractOne(x['Company'], known_list, scorer=fuzz.partial_token_sort_ratio)
return pd.Series([match[0], match[1]])
data[['this year','match_rating']] = data.dropna(subset=['Company']).apply(find_match, axis=1, result_type='expand')
收益率:
Company Company name this year \
0 BODYPHLO SPORTIQUE NaN SPORTS ACADEMY & RAC CLUB
1 JOSEPH A PERRY NaN CHUCKS PRO SHOP AT
2 PCH RESORT TENNIS SHOP NaN LEWIS TENNIS
3 GREYSTONE GOLF CLUB INC. NaN LAKE COUNTRY RACQUET CLUB
4 MUSGROVE COUNTRY CLUB NaN LAKE COUNTRY RACQUET CLUB
5 CITY OF PELHAM RACQUET CLUB NaN LAKE COUNTRY RACQUET CLUB
6 NORTHRIVER YACHT CLUB NaN LAKE COUNTRY RACQUET CLUB
7 LAKE FOREST NaN LAKE COUNTRY RACQUET CLUB
8 TNL TENNIS PRO SHOP NaN LEWIS TENNIS
9 SOUTHERN ATHLETIC CLUB NaN SPORTS ACADEMY & RAC CLUB
10 ORANGE BEACH TENNIS CENTER NaN LEWIS TENNIS
match_rating
0 47.0
1 43.0
2 67.0
3 43.0
4 67.0
5 72.0
6 48.0
7 64.0
8 67.0
9 50.0
10 67.0
请您使用
data.head(10)
,包含数据框的前10行左右好吗?因此,我在初始问题中添加了数据集的示例。从逻辑上讲,这是完全有道理的,但出于某种原因,当我应用它并将“数据”发送到csv或尝试将“数据”打印到Jupyter时,什么都没有出现。这是因为我的数据帧的结构还是?你的数据帧结构很奇怪,但我编辑了我的答案,以适应你的情况(因为我没有你的完整数据帧,我的答案中的匹配将是胡说八道)谢谢!这太完美了。谢谢你的帮助。
Company Company name this year \
0 BODYPHLO SPORTIQUE NaN SPORTS ACADEMY & RAC CLUB
1 JOSEPH A PERRY NaN CHUCKS PRO SHOP AT
2 PCH RESORT TENNIS SHOP NaN LEWIS TENNIS
3 GREYSTONE GOLF CLUB INC. NaN LAKE COUNTRY RACQUET CLUB
4 MUSGROVE COUNTRY CLUB NaN LAKE COUNTRY RACQUET CLUB
5 CITY OF PELHAM RACQUET CLUB NaN LAKE COUNTRY RACQUET CLUB
6 NORTHRIVER YACHT CLUB NaN LAKE COUNTRY RACQUET CLUB
7 LAKE FOREST NaN LAKE COUNTRY RACQUET CLUB
8 TNL TENNIS PRO SHOP NaN LEWIS TENNIS
9 SOUTHERN ATHLETIC CLUB NaN SPORTS ACADEMY & RAC CLUB
10 ORANGE BEACH TENNIS CENTER NaN LEWIS TENNIS
match_rating
0 47.0
1 43.0
2 67.0
3 43.0
4 67.0
5 72.0
6 48.0
7 64.0
8 67.0
9 50.0
10 67.0