Python 有没有一个简单的方法告诉SpaCy在使用.similarity方法时忽略停止词？_Python_Nlp_Spacy

Python 有没有一个简单的方法告诉SpaCy在使用.similarity方法时忽略停止词？

python nlp

Python 有没有一个简单的方法告诉SpaCy在使用.similarity方法时忽略停止词？,python,nlp,spacy,Python,Nlp,Spacy,所以现在我有一个非常简单的程序，它将获取一个句子，并在给定的书中找到语义最相似的句子，然后打印出这个句子以及接下来的几个句子 import spacy nlp = spacy.load('en_core_web_lg') #load alice in wonderland from gutenberg.acquire import load_etext from gutenberg.cleanup import strip_headers text = strip_headers(load_e

所以现在我有一个非常简单的程序，它将获取一个句子，并在给定的书中找到语义最相似的句子，然后打印出这个句子以及接下来的几个句子

import spacy
nlp = spacy.load('en_core_web_lg')

#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()

alice = nlp(text)

sentences = list(alice.sents)

mysent = nlp(unicode("example sentence, could be whatever"))

best_match = None
best_similarity_value = 0
for sent in sentences:
    similarity = sent.similarity(mysent)
    if similarity > best_similarity_value:
        best_similarity_value = similarity
        best_match = sent

print sentences[sentences.index(best_match):sentences.index(best_match)+10]

我想通过告诉SpaCy在执行此过程时忽略停止词来获得更好的结果，但我不知道最好的方法。就像我可以创建一个新的空白列表，并将每个不是停止词的单词附加到列表中一样

for sentence in sentences:
    for word in sentence:
        if word.is_stop == 'False':
            newlist.append(word)

但是我必须使它比上面的代码更复杂，因为我必须保持原始句子列表的完整性（因为如果我想以后打印完整的句子，索引必须是相同的）。另外，如果我这样做的话，我将不得不通过SpaCy运行这个新的列表列表，以便使用.similarity方法

我觉得一定有更好的方法来解决这个问题，我非常感谢任何指导。即使没有比将每个不停的单词添加到一个新列表更好的方法，我也希望您能帮助我创建一个列表列表，以便索引与原始的“句子”变量相同

非常感谢

您需要做的是覆盖spaCy计算相似性的方式

对于相似性计算，spaCy首先通过平均每个标记（token.vector属性）的向量来计算每个文档的向量，然后通过执行以下操作来执行余弦相似性：

return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

你必须稍微调整一下，不要考虑停止词的向量

以下代码应适用于您：

import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")

def compute_similarity(doc1, doc2):
    vector1 = np.zeros(300)
    vector2 = np.zeros(300)
    for token in doc1:
        if (token.text not in STOP_WORDS):
            vector1 = vector1 + token.vector
    vector1 = np.divide(vector1, len(doc1))
    for token in doc2:
        if (token.text not in STOP_WORDS):
            vector2 = vector2 + token.vector
    vector2 = np.divide(vector2, len(doc2))
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

print(compute_similarity(doc1, doc2)))

希望有帮助

这里有一个稍微优雅一点的解决方案：我们将覆盖spacy如何计算引擎盖下的文档向量，这将把这种定制传播到任何下游管道组件，如TextCategorizer或其他任何组件

这是基于以下文件：

该解决方案是围绕加载经过预训练的嵌入件而设计的。与直接引用stopwords列表不同的是，我将假设，对于我加载的嵌入，vocab之外的任何内容都是我希望在文档向量计算中忽略的标记

class FancyDocumentVectors(object):
    def __call__(self, doc):
        doc.user_hooks["vector"] = self.vector
        return doc

    def vector(self, doc):
        """
        Constrain attention to non-zero vectors.
        Returns concatenation of mean and max pooling
        """
        # This is the part where we filter out stop words 
        # (really any token for which we couldn't calculate a vector representation).
        # If you'd rather invoke a stopword list, change the line below to something like:
        # doc_vecs = np.array([t.vector for t in doc if t in STOPWORDS])
        doc_vecs = np.array([t.vector for t in doc if t.has_vector])
        if sum(doc_vecs.shape) == 0: 
            doc_vecs = np.array([doc[0].vector])

        mean_pooled = doc_vecs.mean(axis=0)
        
        # Because I'm fancy, I'm going to augment my custom document vector with 
        # some additional information. For a demonstration of the value of this 
        # approach, reference the SWEM paper: https://arxiv.org/abs/1805.09843
        max_pooled = doc_vecs.max(axis=0)
        doc_vec = np.hstack([mean_pooled, max_pooled])
        return doc_vec

        # If you're not into it, just return mean_pooled instead.
        # return mean_pooled

nlp.add_pipe(FancyDocumentVectors())

下面是一个使用stackoverflow上训练的向量的具体示例

首先，我们将预训练的嵌入加载到一个空的语言模型中

import spacy
from gensim.models.keyedvectors import KeyedVectors

# https://github.com/vefstathiou/SO_word2vec
word_vect = KeyedVectors.load_word2vec_format("SO_vectors_200.bin", binary=True)
nlp = spacy.blank('en')
nlp.vocab.vectors = spacy.vocab.Vectors(data=word_vect.syn0, keys=word_vect.index2word)

更改任何内容之前的默认行为：

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 4.353660220883036 -6.901098

重写文档向量计算后修改的行为：

# MAGIC!
nlp.add_pipe(FancyDocumentVectors())

doc = nlp("This is a question about spacy.")
for token in doc:
  print(token, token.vector_norm, token.vector.sum())
print(doc.vector_norm, doc.vector.sum())

# This 0.0 0.0
# is 0.0 0.0
# a 0.0 0.0
# question 25.44337 -41.958717
# about 0.0 0.0
# spacy 13.833485 -6.3489656
# . 0.0 0.0
# 24.601780061609414 109.74769

非常感谢。我有几个问题：1）你是否选择了不同的方法来编写向量1和向量2的代码，以表明它可以用不同的方法来完成（

token.text非STOP\u WORDS

非token.is\u STOP

，还是需要这样做？2）选择300作为np.zero是一个任意的选择吗？代码vector1=vector1+token.vector是否覆盖其中一个0或只是在300个零后添加一个数字？1）在使用模型时发现停止字功能存在一些错误。这不是故意的，我只是忘了在第二个向量中纠正它。为了保持一致，我更新了我的答案。2）这不是任意的。它是spacy用于将单词表示为向量的大小。token.vector的大小是300。啊，明白了。因此，变量vector1最终将是一个列表数组，其长度将等于doc1中不间断字的数量加上第一个300个零的列表？不，你仍然不能100%得到它。每个标记都有一个大小为300的向量表示，vector1是这些表示的平均值（不考虑stop_字）。因此，我们添加每个令牌表示，然后除以令牌的数量。Vector1是大小为300的整个文档（doc对象）的向量表示。回答得很好，我不明白为什么要除以令牌总数（即

len（doc1）

），而不是非停止字的令牌数：IMHO总和应该根据添加的不停止字向量的数量进行加权，不是令牌的总数。示例：“这是一个关于为什么我是这个”和“这是一个婴儿”的句子。第一个句子有8个停止词，而另一个有2个，因此它将被分成4倍多！