Python 空间:不正确的令牌。向量计算

Python 空间:不正确的令牌。向量计算,python,nlp,spacy,Python,Nlp,Spacy,守则: doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.") print doc[0],doc[2],doc[6],doc[8] apples = doc[0] oranges = doc[2] boots = doc[6] hippos = doc[8] print(apples.similarity(oranges)) print(boots.similarity(hippos)) 结果: Apples

守则:

doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print doc[0],doc[2],doc[6],doc[8]
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]
print(apples.similarity(oranges))
print(boots.similarity(hippos))
结果:

Apples oranges Boots hippos
0.0
0.0

表示相似度越高,返回的值越高,但苹果和橙子的相似度为0。 为什么?

编辑 下面的代码解释了相似度计算不正确的原因这是由于矢量计算不正确造成的:

doc = nlp(u'apples is apple. orange is not. oranges is nothing')
def dot_prd(a, b):
    ans = 0
    sa, sb = 0, 0
    for i in range(len(a)):
        ans += a[i]*b[i]
        sa += a[i]*a[i]
        sb += b[i]*b[i]
    sa = sa**0.5
    sb = sb**0.5
    return ans/(sa*sb)

print doc[0], doc[2], doc[4], doc[8]

print dot_prd(doc[0].vector, doc[2].vector), dot_prd(doc[0].vector,      doc[4].vector), dot_prd(doc[0].vector,doc[8].vector), dot_prd(doc[4].vector,    doc[8].vector)

print doc[0].similarity(doc[2]), doc[0].similarity(doc[4]),    doc[0].similarity(doc[8]), doc[4].similarity(doc[8])
输出:

apples apple orange oranges
0.750411317806 0.51238496547 nan nan   #Resuults of cosine-simlarity
0.750411349583 0.512384940626 0.0 0.0  #token.simlarity()
doc[8]。向量
全为零。那么,为什么“oranges”标记的向量计算为all-0?

“橙色”和“苹果”的矢量计算正确。更重要的是,“苹果”的矢量也计算正确。那么,为什么“橙子”是个问题呢

因为2标记(“橙子”和“河马”)的词向量为零(这是模型问题)

您可以通过打印此令牌的向量进行检查:

打印(橙色。矢量)
打印(河马。矢量)

是的,我想。我在GitHub问题上添加了编辑,但忘了在这里更新它。不管怎样,谢谢。现在一切都好了。