Python Gensim：字向量编码问题_Python_Gensim

Python Gensim：字向量编码问题

python

Python Gensim：字向量编码问题,python,gensim,Python,Gensim,在Gensim 2.2.0中使用IMDB电影分级的纯英语文本文件创建单词向量后： import gensim, logging import smart_open, os from nltk.tokenize import RegexpTokenizer VEC_SIZE = 300 MIN_COUNT = 5 WORKERS = 4 data_path = './data/' vectors_path = 'vectors.bin.gz' class AllSentences(objec

在Gensim 2.2.0中使用IMDB电影分级的纯英语文本文件创建单词向量后：

import gensim, logging
import smart_open, os
from nltk.tokenize import RegexpTokenizer

VEC_SIZE = 300 
MIN_COUNT = 5
WORKERS = 4
data_path = './data/'
vectors_path = 'vectors.bin.gz'

class AllSentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
        self.read_err_cnt = 0
        self.tokenizer = RegexpTokenizer('[\'a-zA-Z]+', discard_empty=True)

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            print(fname)
            for line in open(os.path.join(self.dirname, fname)):
                words = []     
                try:
                    for word in self.tokenizer.tokenize(line):
                        words.append(word)
                    yield words
                except:
                    self.read_err_cnt += 1

sentences = AllSentences(data_path)

培训和储蓄模式：

model = gensim.models.Word2Vec(sentences, size=VEC_SIZE, 
                               min_count=MIN_COUNT, workers=WORKERS)
word_vectors = model.wv
word_vectors.save(vectors_path)

然后尝试将其加载回：

vectors = KeyedVectors.load_word2vec_format(vectors_path,
                                                    binary=True,
                                                    unicode_errors='ignore')

我得到'UnicodeDecodeError:'utf-8'编解码器无法解码位置0的字节0x80'异常（见下文）。我尝试了不同的“编码”参数组合，包括'ISO-8859-1'和'Latin1'。还有不同的binary=True/False组合。没有任何帮助-相同的异常，无论使用什么参数。怎么了？如何使加载向量工作

例外情况：

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-64-f353fa49685c> in <module>()
----> 1 w2v = get_w2v_vectors()

<ipython-input-63-cbbe0a76e837> in get_w2v_vectors()
      3     vectors = KeyedVectors.load_word2vec_format(word_vectors_path,
      4                                                     binary=True,
----> 5                                                     unicode_errors='ignore')
      6 
      7                                                 #unicode_errors='ignore')

D:\usr\anaconda\lib\site-packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
    204         logger.info("loading projection weights from %s", fname)
    205         with utils.smart_open(fname) as fin:
--> 206             header = utils.to_unicode(fin.readline(), encoding=encoding)
    207             vocab_size, vector_size = map(int, header.split())  # throws for invalid file format
    208             if limit:

D:\usr\anaconda\lib\site-packages\gensim\utils.py in any2unicode(text, encoding, errors)
    233     if isinstance(text, unicode):
    234         return text
--> 235     return unicode(text, encoding, errors=errors)
    236 to_unicode = any2unicode
    237 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

UnicodeDecodeError回溯（最近一次呼叫上次）
在（）
---->1 w2v=获取向量
在get_w2v_向量（）中
3矢量=关键字矢量。加载文字2矢量格式（文字矢量路径，
4二进制=真，
---->5个unicode_错误（忽略）
6.
7#unicode_errors='ignore'）
D:\usr\anaconda\lib\site packages\gensim\models\keyedvectors.py，load_word2vec_格式（cls、fname、fvocab、二进制、编码、unicode_错误、限制、数据类型）
204 logger.info（“从%s加载投影权重”，fname）
205将utils.smart_open（fname）作为fin：
-->206 header=utils.to_unicode（fin.readline（），encoding=encoding）
207 vocab_size，vector_size=map（int，header.split（））#抛出无效的文件格式
208如果限制：
D:\usr\anaconda\lib\site packages\gensim\utils.py（文本、编码、错误）
233如果isinstance（文本，unicode）：
234返回文本
-->235返回unicode（文本、编码、错误=错误）
236 to_unicode=any2unicode
237
UnicodeDecodeError:“utf-8”编解码器无法解码位置0中的字节0x80:无效的开始字节

如果使用gensim的本机

save（）

方法保存向量，则应使用本机

load（）

方法加载向量

如果要使用

load\u word2vec\u format（）

加载向量，则需要使用

save\u word2vec\u format（）

保存向量。（这样会丢失一些信息，例如准确的出现次数，否则会出现在

KeyedVectors.vocab

字典项中。）