Python scikit学习计数矢量器UnicodeDecodeError

Python scikit学习计数矢量器UnicodeDecodeError,python,scikit-learn,Python,Scikit Learn,我有以下代码片段,其中我试图列出术语频率,first_text和second_text是.tex文档: from sklearn.feature_extraction.text import CountVectorizer training_documents = (first_text, second_text) vectorizer = CountVectorizer() vectorizer.fit_transform(training_documents) print "Vocabu

我有以下代码片段,其中我试图列出术语频率,
first_text
second_text
.tex
文档:

from sklearn.feature_extraction.text import CountVectorizer
training_documents = (first_text, second_text)  
vectorizer = CountVectorizer()
vectorizer.fit_transform(training_documents)
print "Vocabulary:", vectorizer.vocabulary 
当我运行脚本时,我得到以下结果:

File "test.py", line 19, in <module>
    vectorizer.fit_transform(training_documents)
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
    for feature in analyze(doc):
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 115, in decode
    doc = doc.decode(self.encoding, self.decode_error)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 200086: invalid start byte
文件“test.py”,第19行,在
矢量器.拟合变换(培训文档)
文件“/usr/local/lib/python2.7/site packages/sklearn/feature\u extraction/text.py”,第817行,在fit\u transform中
自我修复(词汇)
文件“/usr/local/lib/python2.7/site packages/sklearn/feature\u extraction/text.py”,第752行,在
对于分析中的功能(文档):
文件“/usr/local/lib/python2.7/site packages/sklearn/feature_extraction/text.py”,第238行,在
标记化(预处理(自解码(doc))、停止字)
文件“/usr/local/lib/python2.7/site packages/sklearn/feature_extraction/text.py”,第115行,在decode中
doc=doc.decode(self.encoding,self.decode\u错误)
文件“/usr/local/ceral/python/2.7.11/Frameworks/python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py”,第16行,解码
返回编解码器.utf_8_解码(输入,错误,真)
UnicodeDecodeError:“utf8”编解码器无法解码位置200086中的字节0xa2:无效的开始字节
如何解决此问题


谢谢。

如果您能计算出文档的编码(可能是
拉丁语-1
),您可以通过

vectorizer = CountVectorizer(encoding='latin-1')
否则,您只需跳过包含有问题字节的标记即可

vectorizer = CountVectorizer(decode_error='ignore')