Python 无法使用gensim加载Doc2vec对象_Python_Gensim_Word2vec_Doc2vec

Python 无法使用gensim加载Doc2vec对象

python

Python 无法使用gensim加载Doc2vec对象,python,gensim,word2vec,doc2vec,Python,Gensim,Word2vec,Doc2vec,我正在尝试使用gensim加载一个预先训练好的Doc2vec模型，并使用它将一个段落映射到一个向量。我指的是我下载的预先训练过的模型是英文维基百科DBOW，它也在同一个链接中。但是，当我在wikipedia上加载Doc2vec模型并使用以下代码推断向量时： import gensim.models as g import codecs model="wiki_sg/word2vec.bin" test_docs="test_docs.txt" output_file="test_vectors

我正在尝试使用gensim加载一个预先训练好的Doc2vec模型，并使用它将一个段落映射到一个向量。我指的是我下载的预先训练过的模型是英文维基百科DBOW，它也在同一个链接中。但是，当我在wikipedia上加载Doc2vec模型并使用以下代码推断向量时：

import gensim.models as g
import codecs

model="wiki_sg/word2vec.bin"
test_docs="test_docs.txt"
output_file="test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
test_docs = [x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines()]
m = g.Doc2Vec.load(model)

#infer test vectors
output = open(output_file, "w")
for d in test_docs:
    output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
output.flush()
output.close()

我得到一个错误：

/Users/zhangji/Desktop/CSE547/Project/NLP/venv/lib/python2.7/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
  File "/Users/zhangji/Desktop/CSE547/Project/NLP/AbstractMapping.py", line 19, in <module>
    output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n")
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'

此外，在查看gensim包中的源代码后，我发现当我使用Doc2vec.load（）时，Doc2vec类本身并没有load（）函数，但由于它是Word2vec的一个子类，因此它调用Word2vec中load（）的超级方法，然后将模型m设为Word2vec对象。但是，expert_vector（）函数是Doc2vec独有的，并且在Word2vec中不存在，这就是它导致错误的原因。我还尝试将模型m强制转换为Doc2vec，但出现以下错误：

>>> g.Doc2Vec(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 599, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 513, in build_vocab
    self.scan_vocab(sentences, trim_rule=trim_rule)  # initial survey
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 635, in scan_vocab
    for document_no, document in enumerate(documents):
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 1367, in __getitem__
    return vstack([self.syn0[self.vocab[word].index] for word in words])
TypeError: 'int' object is not iterable

>g.Doc2Vec（m）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
文件“/Users/zhangji/Library/Python/2.7/lib/Python/site packages/gensim/models/doc2vec.py”，第599行，在__
self.build\u vocab（文档，trim\u规则=trim\u规则）
文件“/Users/zhangji/Library/Python/2.7/lib/Python/site-packages/gensim/models/word2vec.py”，第513行，内建
self.scan_vocab（句子，trim_rule=trim_rule）#初始调查
文件“/Users/zhangji/Library/Python/2.7/lib/Python/site packages/gensim/models/doc2vec.py”，第635行，在scan_vocab中
对于文件编号，枚举中的文件（文件）：
文件“/Users/zhangji/Library/Python/2.7/lib/Python/site packages/gensim/models/word2vec.py”，第1367行，在__
返回vstack（[self.syn0[self.vocab[word].index]for word in words]）
TypeError:“int”对象不可编辑

事实上，我现在只想用gensim将一个段落转换成一个向量，使用一个预先训练好的模型，该模型在学术文章中效果很好。出于某些原因，我不想独自训练这些模型。如果有人能帮我解决这个问题，我将不胜感激

顺便说一句，我使用的是python2.7，当前的gensim版本是0.12.4

谢谢

我会避免使用4年前的非标准gensim fork at，或任何仅加载此类代码的4年前保存的模型

Wikipedia DBOW模型的容量也小得令人怀疑，只有1.4GB。甚至在4年前，维基百科就有超过400万篇文章，一个300维的

Doc2Vec

模型经过训练，为400万篇文章提供文档向量，其大小至少为

4000000篇文章*300维*4字节/维

=4.8GB，甚至不包括模型的其他部分。（因此，下载的显然不是相关论文中提到的430万文档、300维模型——而是以其他不清楚的方式被截断的模型。）

目前的gensim版本是3.8.3，几周前发布

使用当前代码和当前Wikipedia转储构建您自己的

Doc2Vec

模型可能需要一些修修补补，并需要一整夜或更长时间的运行时间，但您现在使用的是现代支持的代码，现代模型能够更好地理解过去4年中使用的单词。（而且，如果您在您感兴趣的文档（如学术文章）的语料库上训练了一个模型，那么词汇、词义和匹配您自己的文本预处理将更好地用于以后的推断文档。）

有一个Jupyter笔记本的例子，它从Wikipedia构建了一个

Doc2Vec

模型，该模型在

gensim

源代码树中运行或非常接近运行：

>>> g.Doc2Vec(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 599, in __init__
    self.build_vocab(documents, trim_rule=trim_rule)
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 513, in build_vocab
    self.scan_vocab(sentences, trim_rule=trim_rule)  # initial survey
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/doc2vec.py", line 635, in scan_vocab
    for document_no, document in enumerate(documents):
  File "/Users/zhangji/Library/Python/2.7/lib/python/site-packages/gensim/models/word2vec.py", line 1367, in __getitem__
    return vstack([self.syn0[self.vocab[word].index] for word in words])
TypeError: 'int' object is not iterable