Nlp 在“中保存预先训练的fasttext向量时出现问题”；word2vec"；使用_save_word2vec_format（）格式化_Nlp_Gensim_Word2vec_Word Embedding_Fasttext

Nlp 在“中保存预先训练的fasttext向量时出现问题”；word2vec"；使用_save_word2vec_format（）格式化

nlp

Nlp 在“中保存预先训练的fasttext向量时出现问题”；word2vec"；使用_save_word2vec_format（）格式化,nlp,gensim,word2vec,word-embedding,fasttext,Nlp,Gensim,Word2vec,Word Embedding,Fasttext,对于单词列表，我想获取它们的fasttext向量，并将它们以相同的“word2vec”.txt格式（word+space+vector in txt格式）保存到一个文件中这就是我所做的： dict = open("word_list.txt","r") #the list of words I have path = "cc.en.300.bin" model = load_facebook_model(path) vecto

对于单词列表，我想获取它们的fasttext向量，并将它们以相同的“word2vec”.txt格式（word+space+vector in txt格式）保存到一个文件中

这就是我所做的：

dict = open("word_list.txt","r") #the list of words I have

path = "cc.en.300.bin" 

model = load_facebook_model(path)

vectors = []

words =[] 

for word in dict: 
    vectors.append(model[word])
    words.append(word)

vectors_array = np.array(vectors)

*我想将列表“words”和nd.array“vectors\u array”保存为原始的.txt格式

我尝试使用gensim中的函数“\u save\u word2vec\u format”：

但我得到了一个错误：

INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from cc.en.300.bin
trials.py:42: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  vectors.append(model[word])
INFO:__main__:storing 8664x300 projection weights into arrays_to_txt_oct3.txt
loading the model for: en
finish loading the model for: en
len(vectors): 8664
len(words):  8664
shape of vectors_array (8664, 300)
mission launched!
Traceback (most recent call last):
  File "trials.py", line 102, in <module>
    _save_word2vec_format(YOUR_VEC_FILE_PATH, words, vectors_array, fvocab=None, binary=False, total_vec=None)
  File "trials.py", line 89, in _save_word2vec_format
    for word, vocab_ in sorted(iteritems(vocab), key=lambda item: -item[1].count):
  File "/cs/snapless/oabend/tailin/transdiv/lib/python3.7/site-packages/six.py", line 589, in iteritems
    return iter(d.items(**kw))
AttributeError: 'list' object has no attribute 'items'

但仍然收到错误消息：

Traceback (most recent call last):
  File "trials.py", line 99, in <module>
    _save_word2vec_format(YOUR_VEC_FILE_PATH, dict, vectors_array, fvocab=None, binary=False, total_vec=None)
  File "trials.py", line 77, in _save_word2vec_format
    total_vec = len(vocab)
TypeError: object of type '_io.TextIOWrapper' has no len()

回溯（最近一次呼叫最后一次）：
文件“trials.py”，第99行，在
_保存word2vec格式（您的文件路径、dict、向量数组、fvocab=None、binary=False、total=None）
文件“trials.py”，第77行，采用_save_word2vec_格式
总计=len（vocab）
TypeError:类型为“\u io.TextIOWrapper”的对象没有len（）

我不知道如何以正确的格式插入单词列表…

您可以直接导入并重新使用Gensim

KeyedVectors

类，将您自己的（子）单词向量集组合为

KeyedVectors

的一个实例，然后使用其

.save\u word2vec\u format（）

方法

例如，大致上这应该是可行的：

从gensim.models导入关键向量
words_file=open（“word_list.txt”，“r”）#将单词列表作为文本文件打开
words_list=list（words_file）#将文件的每一行读入一个新的'list'对象
fasttext_path=“cc.en.300.bin”
模型=加载模型（路径）
kv=关键向量（向量大小=model.wv.vector大小）#新的空kv对象
向量=[]
对于单词列表中的单词：
vectors.append（model[word]）#单词列表的向量，顺序相同
kv.add（单词列表，向量）#批量添加这些键（单词）和向量
kv.save_word2vec_格式（'my_kv.vec'，binary=False）

首先，避免为自己的变量命名

dict

——这是Python核心类型/函数的名称。（而且，您还希望避免调用充其量只是一个简单的单词列表的东西，但在您的情况下，可能是从文件中读取IO序列，如

dict

）此外，您在问题中没有显示任何调用

\u save\u word2vec\u function（）

的代码行，因此不清楚您提供了哪些参数。（错误表明您提供了一个

列表

，它希望在其中输入更多的

dict

-类似的内容。）此外，通常您希望避免剪切和粘贴

\u save\u word2vec\u format（）

源代码，而不是导入它（或类/函数的Gensim API的其他部分）以便在您自己的代码中重用（尤其是在不更改代码的情况下。）事实上，您可能应该使用实用程序类

KeyedVectors

来组合向量集，然后保存它们-我很快会在回答中显示出来。非常感谢您的宝贵评论！！！我学到了很多，在我未来的代码中一定会考虑到您所说的一切：）真的。非常感谢。

#convert list of words into a dictionary
words_dict = {i:x for i,x in enumerate(words)}

Traceback (most recent call last):
  File "trials.py", line 99, in <module>
    _save_word2vec_format(YOUR_VEC_FILE_PATH, dict, vectors_array, fvocab=None, binary=False, total_vec=None)
  File "trials.py", line 77, in _save_word2vec_format
    total_vec = len(vocab)
TypeError: object of type '_io.TextIOWrapper' has no len()