使用w2v时python中的编码问题_Python_Gensim_Word2vec

使用w2v时python中的编码问题

python

使用w2v时python中的编码问题,python,gensim,word2vec,Python,Gensim,Word2vec,我正在用python编写第一个使用word2vec模型的应用程序。这是我的简单代码 import gensim, logging import sys import warnings from gensim.models import Word2Vec logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) def main(): ####LOAD MO

我正在用python编写第一个使用word2vec模型的应用程序。这是我的简单代码

import gensim, logging
import sys
import warnings
from gensim.models import Word2Vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

def main(): 
    ####LOAD MODEL
    model = Word2Vec.load_word2vec_format('models/vec-cbow.txt', binary=False)  
    model.similarity('man', 'women')

if __name__ == '__main__':
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        #warnings.simplefilter("ignore")
    main()

我发现以下错误：

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 96-97: invalid continuation byte

我试图通过添加这两行来解决这个问题，但仍然得到了错误

reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8') #UTF8 #latin-1

w2v模型是根据英语句子训练的

编辑：以下是完整的堆栈：

**%run "...\getSimilarity.py"**
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
**...\getSimilarity.py in <module>()**
     64         warnings.simplefilter("error")
     65         #warnings.simplefilter("ignore")
---> 66     main()

**...\getSimilarity.py in main()**
     30     ####LOAD MODEL
---> 31     model = Word2Vec.load_word2vec_format('models/vec-cbow.txt', binary=False)  # C binary format
     32     model.similarity('man', 'women')

**...\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\models\word2vec.pyc in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors)**
   1090             else:
   1091                 for line_no, line in enumerate(fin):
-> 1092                     parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
   1093                     if len(parts) != vector_size + 1:
   1094                         raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))

**...\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\utils.pyc in any2unicode(text, encoding, errors)**
    215     if isinstance(text, unicode):
    216         return text
--> 217     return unicode(text, encoding, errors=errors)
    218 to_unicode = any2unicode
    219 

**...\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.6.2.3262.win-x86_64\lib\encodings\utf_8.pyc in decode(input, errors)**
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

**UnicodeDecodeError: 'utf8' codec can't decode bytes in position 96-97: invalid continuation byte**

**%run“…\getSimilarity.py”**
---------------------------------------------------------------------------
UnicodeDecodeError回溯（最近一次呼叫最后一次）
**…\getSimilarity.py在（）**
64警告。simplefilter（“错误”）
65#警告。simplefilter（“忽略”）
--->66主要内容（）
**…\getSimilarity.py在main（）中**
30#####负荷模型
--->31 model=Word2Vec.load_Word2Vec_格式（'models/vec cbow.txt'，binary=False）#C二进制格式
32.相似性（“男性”、“女性”）
**…\AppData\Local\Enthught\Canopy\User\lib\site packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\models\word2vec.pyc采用load\U word2vec\u格式（cls、fname、fvocab、二进制、编码、unicode\u错误）**
1090其他：
1091对于行号，枚举中的行（fin）：
->1092 parts=utils.to_unicode（line.rstrip（），encoding=encoding，errors=unicode_errors）。拆分（“”）
1093如果透镜（零件）！=向量大小+1：
1094 raise VALUERROR（“第%s行上的向量无效（这真的是文本格式吗？）”%（第_行）
**…\AppData\Local\Enthught\Canopy\User\lib\site packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\utils.pyc（文本、编码、错误）**
215如果isinstance（文本，unicode）：
216返回文本
-->217返回unicode（文本、编码、错误=错误）
218 to_unicode=any2unicode
219
**…\AppData\Local\Enthught\Canopy\App\AppData\Canopy-1.6.2.3262.win-x86\U 64\lib\encodings\utf\U 8.pyc正在解码（输入，错误）**
14
15 def解码（输入，错误='strict'）：
--->16返回编解码器。utf_8_解码（输入，错误，真）
17
18类递增编码器（编解码器.递增编码器）：
**UnicodeDecodeError:“utf8”编解码器无法解码位置96-97中的字节：无效的连续字节**

如何解决这个问题有什么提示吗？

提前感谢。

我通过阅读本页找到了解决方案。

“存储在您的模型中的字符串（单词）不是有效的utf8。默认情况下，gensim使用严格的编码设置对单词进行解码，这会导致在遇到无效utf8序列时出现上述异常。”

修复程序在您这边，它将：

a）使用理解unicode和utf8的程序（如gensim）存储模型。一些C和Java word2vec工具已知会截断字节边界处的字符串，这可能导致将多字节utf8字符切成两半，使其无效utf8，从而导致此错误

b）运行load_word2vec_模型时设置unicode_错误标志，例如load_word2vec_模型（…，unicode_errors='ignore'）。请注意，这会消除错误，但utf8问题仍然存在——在这种情况下，将忽略无效的utf8字符

原因：

模型中存储的字符串（单词）不是有效的utf8。默认情况下，gensim使用严格的编码设置对单词进行解码，每当遇到无效的utf8序列时，就会导致上述异常

--从gensim常见问题解答中，您可以选择将unicode_错误设置为“忽略”或“替换”，这似乎在某些情况下有效，但并非所有情况下都有效

但是，如果您查看该函数的具体帮助，还有以下内容：

binary is a boolean indicating whether the data is in binary word2vec format

这是因为word2vec模型保存为二进制，而不是任何编码字符串。因此，在所有这些情况下，只需设置binary=True即可

例如，如果您正试图使用来自的google预先培训的模型，这应该可以：

google_model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)

希望这有帮助

models/vec cbow.txt有多大？可以通过文件共享网站将其包含在问题中吗？它似乎不是utf-8编码的，它是2.25GB。我不明白你所说的“可以通过文件共享网站将其包含在问题中”是什么意思？不，太大了。没有意义。那么，你有什么建议？如何知道它的编码？你可以使用。它可以预测正确的编码。但尝试在文档或自述文件等中查找编码，但出现另一个错误：ValueError:invalid vector on line 0