Nlp 为什么Gensim在doc2vec中最为相似，给出了与输出相同的向量？_Nlp_Data Mining_Gensim_Word2vec_Doc2vec

Nlp 为什么Gensim在doc2vec中最为相似，给出了与输出相同的向量？

nlp

Nlp 为什么Gensim在doc2vec中最为相似，给出了与输出相同的向量？,nlp,data-mining,gensim,word2vec,doc2vec,Nlp,Data Mining,Gensim,Word2vec,Doc2vec,我使用以下代码来获取用户帖子的有序列表 model = doc2vec.Doc2Vec.load(doc2vec_model_name) doc_vectors = model.docvecs.doctag_syn0 doc_tags = model.docvecs.offset2doctag for w, sim in model.docvecs.most_similar(positive=[model.infer_vector('phone_comments')], topn=4000):

我使用以下代码来获取用户帖子的有序列表

model = doc2vec.Doc2Vec.load(doc2vec_model_name)
doc_vectors = model.docvecs.doctag_syn0
doc_tags = model.docvecs.offset2doctag

for w, sim in model.docvecs.most_similar(positive=[model.infer_vector('phone_comments')], topn=4000):
        print(w, sim)
        fw.write(w)
        fw.write(" (")
        fw.write(str(sim))
        fw.write(")")
        fw.write("\n")

fw.close()

然而，我也得到了向量

“电话评论”

（我用来寻找最近的邻居）在列表的第六位。代码中有错误吗？或者这是Gensim中的一个问题（因为向量不能是它自己的邻居）

编辑

Doc2vec模型训练代码

######Preprocessing
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for key, value in my_d.items():
    value = re.sub("[^1-9a-zA-Z]"," ", value)
    words = value.lower().split()
    tags = key.replace(' ', '_')
    docs.append(analyzedDocument(words, tags.split(' ')))

sentences = []  # Initialize an empty list of sentences
######Get n-grams
#Get list of lists of tokenised words. 1 sentence = 1 list
for item in docs:
    sentences.append(item.words)

#identify bigrams and trigrams (trigram_sentences_project) 
trigram_sentences_project = []
bigram = Phrases(sentences, min_count=5, delimiter=b' ')
trigram = Phrases(bigram[sentences], min_count=5, delimiter=b' ')

for sent in sentences:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]
    trigram_sentences_project.append(trigrams_)

paper_count = 0
for item in trigram_sentences_project:
    docs[paper_count] = docs[paper_count]._replace(words=item)
    paper_count = paper_count+1

# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 5, workers = 4, iter = 20)

#Save the trained model for later use to take the similarity values
model_name = user_defined_doc2vec_model_name
model.save(model_name)

expert_vector（）

方法需要一个令牌列表，就像用于训练模型的文本示例（

TaggedDocument

对象）的

words

属性一样

您提供了一个简单的字符串，

'phone\u comments'

，它将查找

推断向量（）

，如列表

['p'，'h'，'o'，'n'，'e'，'u'，'c'，'o'，'m'，'m'，'e'，'n'，'t'，'s']

。因此，

最相似（）

的原始向量可能是垃圾

此外，您不会返回输入

'phone\u comments'

，而是返回不同的字符串

'phone comments'

。如果这是模型中的标记名，则该标记必须是在模型培训期间提供的

标记。它与电话评论的表面相似性可能毫无意义——它们是不同的字符串
（但这也可能暗示您的培训也有问题，并将本应是words=['phone'，'comments']
的文本培训为words=['p'，'h'，'o'，'n'，'e'，'c'，'o'，'m'，'m'，'e'，'n'，'t'，'s']
。）
非常感谢您的精彩回答：）我已经用我的doc2vec培训代码更新了我的问题。正如您所提到的，我正在使用TaggedDocument
对象。你能告诉我我在哪里出错吗？：）顺便问一下，phone\u comment
是我在Tarning中使用的一个标签（包含与电话相关的注释列表）当前的问题是什么？在将提供给推断向量（）
的内容修复为令牌列表后，是否会得到相同的结果？（您不应该）在进行修复之后，如果结果有任何问题怎么办？expert_vector（）方法需要一个标记列表，就像用于训练模型的文本示例的words属性一样。因此，如果您在培训期间使用words=['phone'，'comments']
提供文本，要为类似文本推断向量，您需要执行推断向量（['phone'，'comments']）
使用推断向量（）
方法从新文本创建新文档向量。因此，您应该向其提供新文档的单词。您没有为其提供任何现有标记。如果您想获取在批量培训期间为您提供的标记之一培训的向量，只需使用索引访问进行查找：model.docvecs['phone\u comments']
。