Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用levenshtein函数删除熊猫中的类似值_Python_Pandas_Dataframe_Pandas Groupby - Fatal编程技术网

Python 如何使用levenshtein函数删除熊猫中的类似值

Python 如何使用levenshtein函数删除熊猫中的类似值,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,我有一个数据框,看起来像- ML_ENTITY_NAME EDT_ENTITY_NAME 1 ABC BANK HABIB METROPOLITAN BANK 2 ABC BANK HABIB METROPOLITIAN BANK 3 BANK OF AMERICA HSBC BANK MALAYSIA BHD 4 BANK OF AMERICA HSBC BANK MALAYSIA SDN

我有一个数据框,看起来像-

   ML_ENTITY_NAME        EDT_ENTITY_NAME
1  ABC BANK              HABIB METROPOLITAN BANK
2  ABC BANK              HABIB METROPOLITIAN BANK
3  BANK OF AMERICA       HSBC BANK MALAYSIA BHD
4  BANK OF AMERICA       HSBC BANK MALAYSIA SDN BHD
5  BANK OF NEW ZEALAND   HUA NAN COMMERCIAL BANK
6  BANK OF NEW ZEALAND   HUA NAN COMMERCIAL BANK LTD
7  CITIBANK N.A.         CHINA GUANGFA BANK CO LTD
8  CITIBANK N.A.         CHINA GUANGFA BANK CO.,LTD
9  SECURITY BANK CORP.   SECURITY BANK CORP
10 SIAM COMMERCIAL BANK  THE SIAM COMMERCIAL BANK PCL
11 TEMU                  ANZ BANK SAMOA LTD
   ML_ENTITY_NAME        EDT_ENTITY_NAME
1  ABC BANK              HABIB METROPOLITIAN BANK
2  BANK OF AMERICA       HSBC BANK MALAYSIA SDN BHD
3  BANK OF NEW ZEALAND   HUA NAN COMMERCIAL BANK LTD
4  CITIBANK N.A.         CHINA GUANGFA BANK CO.,LTD
5  SECURITY BANK CORP.   SECURITY BANK CORP
6  SIAM COMMERCIAL BANK  THE SIAM COMMERCIAL BANK PCL
7  TEMU                  ANZ BANK SAMOA LTD
我写了一个levenshtein函数

def fm(s1, s2):
    score = Levenshtein.distance(s1,s2)
    if score == 0.0:
        score = 1.0
    else:
        score = 1 - (score / len(s1))
    return score
df.sort_values(by=['ML_ENTITY_NAME','EDT_ENTITY_NAME'],inplace=True)
df['delete']=0
for row1 in df.itertuples():
    for row2 in df.itertuples():
        if (str(row1.ML_ENTITY_NAME) == str(row2.ML_ENTITY_NAME)) and (1>fm(str(row1.EDT_ENTITY_NAME),str(row2.EDT_ENTITY_NAME))>.74):

            if(len(row1.EDT_ENTITY_NAME)>len(row2.EDT_ENTITY_NAME)):
                df.loc[row2.Index,row2[2]]=1
print(df)
我想写一个代码,如果两个
EDT_ENTITY_NAME
值的levenstein分数
大于.75
,那么我们删除长度较小的一个值,保留长度较大的一个值。同样,用于比较的
ML_ENTITY_NAME
应该是相同的

我的最终输出应该是-

   ML_ENTITY_NAME        EDT_ENTITY_NAME
1  ABC BANK              HABIB METROPOLITAN BANK
2  ABC BANK              HABIB METROPOLITIAN BANK
3  BANK OF AMERICA       HSBC BANK MALAYSIA BHD
4  BANK OF AMERICA       HSBC BANK MALAYSIA SDN BHD
5  BANK OF NEW ZEALAND   HUA NAN COMMERCIAL BANK
6  BANK OF NEW ZEALAND   HUA NAN COMMERCIAL BANK LTD
7  CITIBANK N.A.         CHINA GUANGFA BANK CO LTD
8  CITIBANK N.A.         CHINA GUANGFA BANK CO.,LTD
9  SECURITY BANK CORP.   SECURITY BANK CORP
10 SIAM COMMERCIAL BANK  THE SIAM COMMERCIAL BANK PCL
11 TEMU                  ANZ BANK SAMOA LTD
   ML_ENTITY_NAME        EDT_ENTITY_NAME
1  ABC BANK              HABIB METROPOLITIAN BANK
2  BANK OF AMERICA       HSBC BANK MALAYSIA SDN BHD
3  BANK OF NEW ZEALAND   HUA NAN COMMERCIAL BANK LTD
4  CITIBANK N.A.         CHINA GUANGFA BANK CO.,LTD
5  SECURITY BANK CORP.   SECURITY BANK CORP
6  SIAM COMMERCIAL BANK  THE SIAM COMMERCIAL BANK PCL
7  TEMU                  ANZ BANK SAMOA LTD
目前,我的方法是对df进行排序并在循环中迭代,检查ML_ENTITY_NAME值是否相同,然后计算EDT_ENTITY_NAME的levenshtein。我添加了一个新的列delete,如果满足上述条件,并且一个ML_ENTITY_NAME的长度小于另一个ML_ENTITY_NAME,我将把delete列更新为1

我的代码看起来像-

def fm(s1, s2):
    score = Levenshtein.distance(s1,s2)
    if score == 0.0:
        score = 1.0
    else:
        score = 1 - (score / len(s1))
    return score
df.sort_values(by=['ML_ENTITY_NAME','EDT_ENTITY_NAME'],inplace=True)
df['delete']=0
for row1 in df.itertuples():
    for row2 in df.itertuples():
        if (str(row1.ML_ENTITY_NAME) == str(row2.ML_ENTITY_NAME)) and (1>fm(str(row1.EDT_ENTITY_NAME),str(row2.EDT_ENTITY_NAME))>.74):

            if(len(row1.EDT_ENTITY_NAME)>len(row2.EDT_ENTITY_NAME)):
                df.loc[row2.Index,row2[2]]=1
print(df)
目前它给出了错误的输出

有人能帮我提供一些答案/提示/建议吗?

我相信您需要:

#cross join by ML_ENTITY_NAME column
df1 = df.merge(df, on='ML_ENTITY_NAME', how='outer')
#remove same values per rows (distance 1)
df1 = df1[df1['EDT_ENTITY_NAME_x'] != df1['EDT_ENTITY_NAME_y']]
#apply function and compare
m1 = df1.apply(lambda x: fm(x['EDT_ENTITY_NAME_x'], x['EDT_ENTITY_NAME_y']), axis=1) > .75
m2 = df1['EDT_ENTITY_NAME_x'].str.len() > df1['EDT_ENTITY_NAME_y'].str.len()

#filtering
df2 = df1.loc[m1 & m2, ['ML_ENTITY_NAME','EDT_ENTITY_NAME_x']]
#remove  `_x`
df2.columns = df2.columns.str.replace('_x$', '')
#add unique rows per ML_ENTITY_NAME
df2 = df2.append(df[~df['ML_ENTITY_NAME'].duplicated(keep=False)]).reset_index(drop=True)
print (df2)
         ML_ENTITY_NAME               EDT_ENTITY_NAME
0              ABC BANK      HABIB METROPOLITIAN BANK
1       BANK OF AMERICA    HSBC BANK MALAYSIA SDN BHD
2   BANK OF NEW ZEALAND   HUA NAN COMMERCIAL BANK LTD
3         CITIBANK N.A.    CHINA GUANGFA BANK CO.,LTD
4   SECURITY BANK CORP.            SECURITY BANK CORP
5  SIAM COMMERCIAL BANK  THE SIAM COMMERCIAL BANK PCL
6                  TEMU            ANZ BANK SAMOA LTD

你能具体说明你得到的输出有什么错误吗?我在代码中看到的唯一偏离目标是,对于0.74的行对,只将
delete
标记设置为1,而它应该是0.75

另一方面,排序在代码中是多余的,因为您最终会比较每一对可能的行。在实现排序时,您可能想到的是遍历每一对连续的行,这将使代码的复杂性从O(n2)提高到O(n)


另一个需要注意的是,在
fm
函数中不需要
if
语句:语句
score=1-score/len(s1)
将涵盖这两种情况。

这里是ML_ENTITY_NAME和EDT_ENTITY_NAME之间的分数。@MadhurYadav-请现在检查。1。早期,我的排序思想是比较值的第一个字母,如果它们相同,则运行循环。因此,我将节省执行时间。因为我的文件中的变化大多在最后,例如=美国银行、美国银行股份有限公司、美国银行股份有限公司2。我的输出是错误的,因为删除列没有得到更新,并且EDT_ENTITY_NAME值的几列出现在列中。我猜我在同一个df上的O(n^2)循环工作不正常。在我看来,您将第二个索引传递给
df.loc
的方式不正确
row2[2]
始终是
delete
列的值,初始化为0,因此永远不会更改。考虑使用<代码>删除> <代码>:<代码> df.LOC[Ro2.Keal','DeleDe]=1 < /Cord>我尝试过,但我正在显示错误类型错误:元组索引必须是整数,而不是STR。