Python 如何使用levenshtein函数删除熊猫中的类似值
我有一个数据框,看起来像-Python 如何使用levenshtein函数删除熊猫中的类似值,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,我有一个数据框,看起来像- ML_ENTITY_NAME EDT_ENTITY_NAME 1 ABC BANK HABIB METROPOLITAN BANK 2 ABC BANK HABIB METROPOLITIAN BANK 3 BANK OF AMERICA HSBC BANK MALAYSIA BHD 4 BANK OF AMERICA HSBC BANK MALAYSIA SDN
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITAN BANK
2 ABC BANK HABIB METROPOLITIAN BANK
3 BANK OF AMERICA HSBC BANK MALAYSIA BHD
4 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
5 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK
6 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
7 CITIBANK N.A. CHINA GUANGFA BANK CO LTD
8 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
9 SECURITY BANK CORP. SECURITY BANK CORP
10 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
11 TEMU ANZ BANK SAMOA LTD
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITIAN BANK
2 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
3 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
4 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
5 SECURITY BANK CORP. SECURITY BANK CORP
6 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
7 TEMU ANZ BANK SAMOA LTD
我写了一个levenshtein函数
def fm(s1, s2):
score = Levenshtein.distance(s1,s2)
if score == 0.0:
score = 1.0
else:
score = 1 - (score / len(s1))
return score
df.sort_values(by=['ML_ENTITY_NAME','EDT_ENTITY_NAME'],inplace=True)
df['delete']=0
for row1 in df.itertuples():
for row2 in df.itertuples():
if (str(row1.ML_ENTITY_NAME) == str(row2.ML_ENTITY_NAME)) and (1>fm(str(row1.EDT_ENTITY_NAME),str(row2.EDT_ENTITY_NAME))>.74):
if(len(row1.EDT_ENTITY_NAME)>len(row2.EDT_ENTITY_NAME)):
df.loc[row2.Index,row2[2]]=1
print(df)
我想写一个代码,如果两个EDT_ENTITY_NAME
值的levenstein分数大于.75
,那么我们删除长度较小的一个值,保留长度较大的一个值。同样,用于比较的ML_ENTITY_NAME
应该是相同的
我的最终输出应该是-
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITAN BANK
2 ABC BANK HABIB METROPOLITIAN BANK
3 BANK OF AMERICA HSBC BANK MALAYSIA BHD
4 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
5 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK
6 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
7 CITIBANK N.A. CHINA GUANGFA BANK CO LTD
8 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
9 SECURITY BANK CORP. SECURITY BANK CORP
10 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
11 TEMU ANZ BANK SAMOA LTD
ML_ENTITY_NAME EDT_ENTITY_NAME
1 ABC BANK HABIB METROPOLITIAN BANK
2 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
3 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
4 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
5 SECURITY BANK CORP. SECURITY BANK CORP
6 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
7 TEMU ANZ BANK SAMOA LTD
目前,我的方法是对df进行排序并在循环中迭代,检查ML_ENTITY_NAME值是否相同,然后计算EDT_ENTITY_NAME的levenshtein。我添加了一个新的列delete,如果满足上述条件,并且一个ML_ENTITY_NAME的长度小于另一个ML_ENTITY_NAME,我将把delete列更新为1
我的代码看起来像-
def fm(s1, s2):
score = Levenshtein.distance(s1,s2)
if score == 0.0:
score = 1.0
else:
score = 1 - (score / len(s1))
return score
df.sort_values(by=['ML_ENTITY_NAME','EDT_ENTITY_NAME'],inplace=True)
df['delete']=0
for row1 in df.itertuples():
for row2 in df.itertuples():
if (str(row1.ML_ENTITY_NAME) == str(row2.ML_ENTITY_NAME)) and (1>fm(str(row1.EDT_ENTITY_NAME),str(row2.EDT_ENTITY_NAME))>.74):
if(len(row1.EDT_ENTITY_NAME)>len(row2.EDT_ENTITY_NAME)):
df.loc[row2.Index,row2[2]]=1
print(df)
目前它给出了错误的输出
有人能帮我提供一些答案/提示/建议吗?我相信您需要:
#cross join by ML_ENTITY_NAME column
df1 = df.merge(df, on='ML_ENTITY_NAME', how='outer')
#remove same values per rows (distance 1)
df1 = df1[df1['EDT_ENTITY_NAME_x'] != df1['EDT_ENTITY_NAME_y']]
#apply function and compare
m1 = df1.apply(lambda x: fm(x['EDT_ENTITY_NAME_x'], x['EDT_ENTITY_NAME_y']), axis=1) > .75
m2 = df1['EDT_ENTITY_NAME_x'].str.len() > df1['EDT_ENTITY_NAME_y'].str.len()
#filtering
df2 = df1.loc[m1 & m2, ['ML_ENTITY_NAME','EDT_ENTITY_NAME_x']]
#remove `_x`
df2.columns = df2.columns.str.replace('_x$', '')
#add unique rows per ML_ENTITY_NAME
df2 = df2.append(df[~df['ML_ENTITY_NAME'].duplicated(keep=False)]).reset_index(drop=True)
print (df2)
ML_ENTITY_NAME EDT_ENTITY_NAME
0 ABC BANK HABIB METROPOLITIAN BANK
1 BANK OF AMERICA HSBC BANK MALAYSIA SDN BHD
2 BANK OF NEW ZEALAND HUA NAN COMMERCIAL BANK LTD
3 CITIBANK N.A. CHINA GUANGFA BANK CO.,LTD
4 SECURITY BANK CORP. SECURITY BANK CORP
5 SIAM COMMERCIAL BANK THE SIAM COMMERCIAL BANK PCL
6 TEMU ANZ BANK SAMOA LTD
你能具体说明你得到的输出有什么错误吗?我在代码中看到的唯一偏离目标是,对于0.74
delete
标记设置为1,而它应该是0.75另一个需要注意的是,在
fm
函数中不需要if
语句:语句score=1-score/len(s1)
将涵盖这两种情况。这里是ML_ENTITY_NAME和EDT_ENTITY_NAME之间的分数。@MadhurYadav-请现在检查。1。早期,我的排序思想是比较值的第一个字母,如果它们相同,则运行循环。因此,我将节省执行时间。因为我的文件中的变化大多在最后,例如=美国银行、美国银行股份有限公司、美国银行股份有限公司2。我的输出是错误的,因为删除列没有得到更新,并且EDT_ENTITY_NAME值的几列出现在列中。我猜我在同一个df上的O(n^2)循环工作不正常。在我看来,您将第二个索引传递给df.loc
的方式不正确row2[2]
始终是delete
列的值,初始化为0,因此永远不会更改。考虑使用<代码>删除> <代码>:<代码> df.LOC[Ro2.Keal','DeleDe]=1 < /Cord>我尝试过,但我正在显示错误类型错误:元组索引必须是整数,而不是STR。