Python Gensim Doc2Vec模型只生成有限数量的向量_Python_Nlp_Gensim_Doc2vec

Python Gensim Doc2Vec模型只生成有限数量的向量

python nlp

Python Gensim Doc2Vec模型只生成有限数量的向量,python,nlp,gensim,doc2vec,Python,Nlp,Gensim,Doc2vec,我正在使用gensim Doc2Vec模型生成我的特征向量。以下是我正在使用的代码，我已经解释了代码中的问题： cores = multiprocessing.cpu_count() # creating a list of tagged documents training_docs = [] # all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sen

我正在使用gensim Doc2Vec模型生成我的特征向量。以下是我正在使用的代码，我已经解释了代码中的问题：

cores = multiprocessing.cpu_count()

# creating a list of tagged documents
training_docs = []

# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences)
for index, doc in enumerate(all_docs):
    # 'doc' is in unicode format and I have already preprocessed it
    training_docs.append(TaggedDocument(doc.split(), str(index+1)))

# at this point, I have 53 strings in my 'training_docs' list 

model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores)

# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list.
print(len(model.docvecs))
# output: 10

我只是想知道我是否犯了错误，或者是否有任何其他参数需要设置

更新：我在TaggedDocument中使用tags参数，当我将其更改为文本和数字的混合物时，如：Doc1，Doc2。。。我看到生成向量的计数有不同的数字，但我仍然没有预期的相同数量的特征向量

查看它在语料库中发现的实际标记：

print(model.docvecs.offset2doctag)

你看到模式了吗

每个文档的tags属性应该是标记列表，而不是单个标记。如果您提供一个简单的整数字符串，它将把它看作一个数字列表，因此只学习标记“0”、“1”、“9”

您可以将strindex+1替换为[strindex+1]，并获得预期的行为

但是，由于您的文档ID只是升序整数，所以您也可以将普通Python int用作doctag。这将节省一些内存，避免创建从字符串标记到数组插槽int的查找dict。为此，请将strindex+1替换为[index]。这将从0开始doc id–这稍微有点像python，还避免浪费原始数组中保存训练向量的未使用的0位置