在Python中使用值对字符串进行分组_Python_Grouping_Levenshtein Distance_Fuzzywuzzy

在Python中使用值对字符串进行分组

python

在Python中使用值对字符串进行分组,python,grouping,levenshtein-distance,fuzzywuzzy,Python,Grouping,Levenshtein Distance,Fuzzywuzzy,我正在研究twitter标签，我已经计算了它们出现在我的csv文件中的次数。我的csv文件看起来像： GilletsJaunes, 100 Macron, 50 gilletsjaune, 20 tax, 10 现在，我想使用fuzzyfuzzy库将两个相近的术语组合在一起，例如“GilletsJaunes”和“gilletsjaune”。如果两个术语之间的接近度大于80，则仅在两个术语中的一个术语中添加其值，并删除另一个术语。这将使： GilletsJaunes, 120 Macron, 5

我正在研究twitter标签，我已经计算了它们出现在我的csv文件中的次数。我的csv文件看起来像：

GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10

现在，我想使用fuzzyfuzzy库将两个相近的术语组合在一起，例如“GilletsJaunes”和“gilletsjaune”。如果两个术语之间的接近度大于80，则仅在两个术语中的一个术语中添加其值，并删除另一个术语。这将使：

GilletsJaunes, 120
Macron, 50
tax, 10

使用“fuzzyfuzzy”时：

这就解决了你的问题。您可以通过首先将标记转换为小写来减少输入样本。我不确定FuzzyWozzy是如何工作的，但我怀疑“HeLlO”和“HeLlO”和“HeLlO”总是大于80，它们代表同一个单词

import csv
from fuzzywuzzy import fuzz

data = dict()
output = dict()
tags = list()

with open('file.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for row in csvReader:
        data[row[0]] = row[1]
        tags.append(row[0])

for tag in tags:
    output[tag] = 0
    for key in data.keys():
        if fuzz.ratio(tag, key) > 80:
            output[tag] = output[tag] + data[key]

首先，复制以计算argmax：

# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
    return max(pairs, key=lambda x: x[1])[0]


# given an iterable of values return the index of the greatest value
def argmax_index(values):
    return argmax(enumerate(values))

其次，将CSV的内容加载到Python字典中，并按如下步骤进行操作：

from fuzzywuzzy import fuzz

input = {
    'GilletsJaunes': 100,
    'Macron': 50,
    'gilletsjaune': 20,
    'tax': 10,
}

threshold = 50

output = dict()
for query in input:
    references = list(output.keys()) # important: this is output.keys(), not input.keys()!
    scores = [fuzz.ratio(query, ref) for ref in references]
    if any(s > threshold for s in scores):
        best_reference = references[argmax_index(scores)]
        output[best_reference] += input[query]
    else:
        output[query] = input[query]

print(output)

{'GilletsJaunes'：120，'Macron'：50，'tax'：10}

到目前为止你试过什么？请展示您的尝试，以便我们可以帮助您纠正它。

from fuzzywuzzy import fuzz

input = {
    'GilletsJaunes': 100,
    'Macron': 50,
    'gilletsjaune': 20,
    'tax': 10,
}

threshold = 50

output = dict()
for query in input:
    references = list(output.keys()) # important: this is output.keys(), not input.keys()!
    scores = [fuzz.ratio(query, ref) for ref in references]
    if any(s > threshold for s in scores):
        best_reference = references[argmax_index(scores)]
        output[best_reference] += input[query]
    else:
        output[query] = input[query]

print(output)