函数python 3.5.2中的UnicodeDecodeError_Python_Unicode

函数python 3.5.2中的UnicodeDecodeError

python unicode

函数python 3.5.2中的UnicodeDecodeError,python,unicode,Python,Unicode,独角兽 def getWordFreqs(textPath, stopWordsPath): wordFreqs = dict() #open the file in read mode and open stop words file = open(textPath, 'r') stopWords = set(line.strip() for line in open(stopWordsPath)) #read the text text = f

独角兽

def getWordFreqs(textPath, stopWordsPath):
    wordFreqs = dict()
    #open the file in read mode and open stop words
    file = open(textPath, 'r')
    stopWords = set(line.strip() for line in open(stopWordsPath))
    #read the text
    text = file.read()
    #exclude punctuation and convert to lower case; exclude numbers as well
    punctuation = set('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~')
    text = ''.join(ch.lower() for ch in text if ch not in punctuation)
    text = ''.join(ch for ch in text if not ch.isdigit())
    #read through the words and add to frequency dictionary
    #if it is not a stop word
    for word in text.split():
        if word not in stopWords:
            if word in wordFreqs:
                wordFreqs[word] += 1
            else:
                wordFreqs[word] = 1

我相信解决问题的一种方法是将此代码放在文件的顶部

import sys
reload(sys)
sys.setdefaultencoding("UTF8")

这会将编码设置为UTF8

另一个（更好的）解决方案是一个名为codecs的库，它非常易于使用

import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )

fileObj是一个可以读取和写入的普通文件对象

方法1的注释

当使用使用ASCII编码的第三方应用程序时，这可能非常危险。小心使用。

在Python 3中，

open

默认使用由

locale.getpreferredencoding（False）

返回的编码。不过，它通常不是ascii，但如果在某种框架下运行，则可能是ascii，错误消息表明了这一点

相反，请指定您试图读取的文件的编码。如果文件是在Windows下创建的，则很可能编码为

cp1252

，特别是因为字节

\x97

是该编码下的

EM破折号
尝试：
使用按钮{}
正确显示代码。请从原始源代码复制并粘贴代码，然后突出显示代码并单击{}，以正确设置代码格式
编辑器中的按钮。可能Python在读取文件时尝试将文件解码为unicode，但它不知道文件中使用了什么编码，因此将其视为ASCII。也许可以在open（）
中尝试encoding=
：我尝试了这个方法，但现在它回退了这个错误。。。UnicodeDecodeError:“utf-8”编解码器无法解码位置520处的字节0x97：无效的开始字节>>>。绝对不是“最好的方式”。如果没有reload（sys）技巧，函数将无法工作，这是有原因的。@MarkTolonen我澄清了我的意思
import codecs
fileObj = codecs.open( "someFile", "r", "utf-8" )

file = open(textPath, 'r', encoding='cp1252')