Python 模糊匹配列和合并/连接数据帧_Python_Pandas_Merge_Fuzzywuzzy

Python 模糊匹配列和合并/连接数据帧

python pandas merge

Python 模糊匹配列和合并/连接数据帧,python,pandas,merge,fuzzywuzzy,Python,Pandas,Merge,Fuzzywuzzy,我正在尝试将两个数据帧与多个列合并，每个数据帧基于每个数据帧上其中一列的匹配值。来自@Erfan的这段代码在模糊匹配目标列方面做得很好，但是是否也有方法来承载其余的列数据帧 df1 = pd.DataFrame({'Key':['Apple Souce', 'Banana', 'Orange', 'Strawberry', 'John tabel']}) df2 = pd.DataFrame({'Key':['Aple suce', 'Mango', 'Orag','Jon table', '

我正在尝试将两个数据帧与多个列合并，每个数据帧基于每个数据帧上其中一列的匹配值。来自@Erfan的这段代码在模糊匹配目标列方面做得很好，但是是否也有方法来承载其余的列

数据帧

df1 = pd.DataFrame({'Key':['Apple Souce', 'Banana', 'Orange', 'Strawberry', 'John tabel']})
df2 = pd.DataFrame({'Key':['Aple suce', 'Mango', 'Orag','Jon table', 'Straw', 'Bannanna', 'Berry'],
                    'Key23':['1', '2', '3','4', '5', '6', '7'})

如上面链接中所述，匹配@Erfan中的函数

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

调用函数

df = fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=1)
df.sort_values(by='Key',ascending=True).reset_index()

结果

index   Key            matches
0       Apple Souce    Aple suce
1       Banana         Bannanna
2       John tabel  
3       Orange  
4       Strawberry     Straw

期望结果

index   Key            matches       Key23
0       Apple Souce    Aple suce     1
1       Banana         Bannanna      6
2       John tabel                   
3       Orange                       
4       Strawberry     Straw         5

给那些需要的人。我想出了一个解决办法。

merge=pd.merge（df，df2，左上=['matches']，右上=['Key']，how='outer'）。fillna（0）

从那里，您可以删除不必要的或重复的列，并获得如下干净结果：

clean=merge.drop（['matches'，'Key\y']，axis=1）

欢迎使用堆栈溢出！请确保您的问题符合指导原则好吗？具体来说，请提供您已经尝试过的内容和您正在尝试完成的内容的准确信息。@Sophos谢谢！刚刚更新了帖子。