Python panda数据帧中多个simliar实体的处理_Python_Pandas_Dataframe_Fuzzywuzzy

Python panda数据帧中多个simliar实体的处理

python pandas dataframe

Python panda数据帧中多个simliar实体的处理,python,pandas,dataframe,fuzzywuzzy,Python,Pandas,Dataframe,Fuzzywuzzy,我有一个带有“Name”列的数据框。存在多个类似条目，但存在一些不一致之处。我想把它们合并成一个。我是数据分析的初学者，并且了解模糊模糊模块。我试过下面的方法 names = list(data['Name'].unique()) def replace_matches(df, column, matching_string, min_ratio = 90): strings = df[column].unique() for i in matching_string:

我有一个带有“Name”列的数据框。存在多个类似条目，但存在一些不一致之处。我想把它们合并成一个。我是数据分析的初学者，并且了解模糊模糊模块。我试过下面的方法

names = list(data['Name'].unique())

def replace_matches(df, column, matching_string, min_ratio = 90):

    strings = df[column].unique()
    for i in matching_string:
        matches = fuzzywuzzy.process.extract(i, strings, limit= 5, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
        close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
        matched_rows = df[column].isin(close_matches)
        df.loc[matched_rows, column] = matching_string
    return df

我正在调用下面的函数：

replace_matches(df = data, column = 'Name', matching_string = names)

但它给出了ValueError：当使用iterable设置时，必须具有相等的len键和值

这里怎么了？是否有其他有效的方法来检查列中所有类似的条目？

您希望如何合并？是否要在“名称”列中只包含唯一值的数据帧？然后，您如何处理其余的列？把它们加起来？你是说什么？仅取第一个条目？这些是重复条目；一些有额外的空间或点；所以，我想用一行来表示这些类似的名称/单词。好吧，那么你想把“Hello World”和“HelloWorld”分组在一起，而“Name”是你唯一的一列？是的，但我还有另外三列，我不介意它们是否被折叠。如果你在问题中包含一个数据示例和期望的结果，编辑可能会有所帮助，因此，给定的解决方案将适合您的问题。