Python 有没有办法得到单词列表之间的相似性分数？_Python_Numpy_Math_Similarity_Cosine Similarity

Python 有没有办法得到单词列表之间的相似性分数？

python numpy math

Python 有没有办法得到单词列表之间的相似性分数？,python,numpy,math,similarity,cosine-similarity,Python,Numpy,Math,Similarity,Cosine Similarity,我想计算单词列表之间的相似性，例如： import math,re from collections import Counter test = ['address','ip'] list_a = ['identifiant', 'ip', 'address', 'fixe', 'horadatee', 'cookie', 'mac', 'machine', 'network', 'cable'] list_b = ['address','city'] def counter_cosine_

我想计算单词列表之间的相似性，例如：

import math,re
from collections import Counter

test = ['address','ip']
list_a = ['identifiant', 'ip', 'address', 'fixe', 'horadatee', 'cookie', 'mac', 'machine', 'network', 'cable']
list_b = ['address','city']

def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    print(c2.get('ip',0)**2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

counter1 = Counter(test)
counter2 = Counter(list_a)
counter3 = Counter(list_b)
score = counter_cosine_similarity(counter1,counter2)
print(score) # output : 0.4472135954999579
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : 0.4999999999999999

对我来说，这并不完全是我想要得到的分数，分数必须相反，因为列表a包含地址和ip，所以这是100%的测试匹配，我知道余弦相似性会在这种情况下与测试和列表a进行比较，所以因为列表a上有一些元素不在测试中，这是因为分数很低，所以我要做的就是，用一种方式，而不是用两种方式，将测试与列表a进行比较

所需输出

score = counter_cosine_similarity(counter1,counter2)
print(score) # output : score higher than list_b = 1.0 may be
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : score less the list_a = 0.5 may be

如果希望值越高，则相同的术语越多，请使用以下代码：

 score = len(set(test).intersection(set(list_x)))

这将告诉您这两个列表有多少通用术语。如果你想获得更高的重复分数，那就试试吧

 commonTerms = set(test).intersection(set(list_x))
 counter = Counter(list_x)
 score = sum((counter.get(term) for term in commonTerms)) #edited

如果您需要将分数缩放到[0..1]，我需要了解更多有关您的数据集的信息

如果您想要一个更高的值，则相同的术语越多，请使用以下代码：

 score = len(set(test).intersection(set(list_x)))

这将告诉您这两个列表有多少通用术语。如果你想获得更高的重复分数，那就试试吧

 commonTerms = set(test).intersection(set(list_x))
 counter = Counter(list_x)
 score = sum((counter.get(term) for term in commonTerms)) #edited

如果您需要将分数缩放到[0..1]，我需要了解更多有关您的数据集的信息

计数器

的源代码在哪里？@Aaron Digulla刚刚编辑您现在可以检查代码为什么要使用

计数器

？它只是告诉你每个单词在列表中出现的频率。因此，在您的情况下，每个术语的值为

。这对确定“距离”有什么帮助？@Aaron Digulla有时我想测试=['address'，'address']以与列表a和列表b进行比较，所以对于我来说，地址是关于位置的，所以当我将测试与列表b进行比较时，分数必须更高，但是我需要在这个cas中做什么才能在列表b中有正确的结果地址，这意味着地址在列表b中的频率很高。这不是关于

python

或

numpy

的问题。你需要从严格的数学意义上定义你当前目的的“相似性”。然后你可以用

python

或任何其他编程语言来实现它。

计数器的源代码在哪里？@Aaron Digulla刚刚编辑过你现在可以检查代码为什么要使用计数器
？它只是告诉你每个单词在列表中出现的频率。因此，在您的情况下，每个术语的值为1
。这对确定“距离”有什么帮助？@Aaron Digulla有时我想测试=['address'，'address']以与列表a和列表b进行比较，所以对于我来说，地址是关于位置的，所以当我将测试与列表b进行比较时，分数必须更高，但是我需要在这个cas中做什么才能在列表b中有正确的结果地址，这意味着地址在列表b中的频率很高。这不是关于python
或numpy
的问题。你需要从严格的数学意义上定义你当前目的的“相似性”。然后您可以在python
或任何其他编程语言中实现它。