Python3 UnicodeDecodeError with readlines（）方法_Python_Python 3.x_Unicode_Tweepy_Sys

Python3 UnicodeDecodeError with readlines（）方法

python python-3.x unicode

Python3 UnicodeDecodeError with readlines（）方法,python,python-3.x,unicode,tweepy,sys,Python,Python 3.x,Unicode,Tweepy,Sys,尝试创建一个twitter机器人来读取行并发布它们。使用Python3和tweepy，通过共享服务器空间上的virtualenv。这是代码中似乎有问题的部分： #!/foo/env/bin/python3 import re import tweepy, time, sys argfile = str(sys.argv[1]) filename=open(argfile, 'r') f=filename.readlines() filename.close() 这是我得到的错误： Uni

尝试创建一个twitter机器人来读取行并发布它们。使用Python3和tweepy，通过共享服务器空间上的virtualenv。这是代码中似乎有问题的部分：

#!/foo/env/bin/python3

import re
import tweepy, time, sys

argfile = str(sys.argv[1])

filename=open(argfile, 'r')
f=filename.readlines()
filename.close()

这是我得到的错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)

该错误特别指向

f=filename.readlines（）

作为错误源。你知道哪里不对吗？谢谢。

最终为自己找到了一个可行的答案：

filename=open(argfile, 'rb')

帮了我很大的忙。

您的默认编码似乎是ASCII，其中输入很可能是UTF-8。当您在输入中点击非ASCII字节时，它会引发异常。与其说是

readlines

本身造成的问题；相反，它导致读取+解码发生，而解码失败

不过这是一个简单的解决办法；Python 3中的默认

打开

允许您提供输入的已知

编码

，用任何其他可识别的编码替换默认编码（在您的情况下为ASCII）。提供它允许您以

str

（而不是显著不同的原始二进制数据

bytes

对象）的形式继续读取，同时让Python完成从原始磁盘字节到真实文本数据的转换工作：

# Using with statement closes the file for us without needing to remember to close
# explicitly, and closes even when exceptions occur
with open(argfile, encoding='utf-8') as inf:
    f = inf.readlines()

我认为最好的答案（在Python 3中）是使用

errors=

参数：

with open('evil_unicode.txt', 'r', errors='replace') as f:
    lines = f.readlines()

证明：

>>> s = b'\xe5abc\nline2\nline3'
>>> with open('evil_unicode.txt','wb') as f:
...     f.write(s)
...
16
>>> with open('evil_unicode.txt', 'r') as f:
...     lines = f.readlines()
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: invalid continuation byte
>>> with open('evil_unicode.txt', 'r', errors='replace') as f:
...     lines = f.readlines()
...
>>> lines
['�abc\n', 'line2\n', 'line3']
>>>

，它有两个非常有用的答案，你应该试试。我使用了编码class='iso-8859-1'，它解决了我的问题problem@hsinghal：ISO-8859-1（又名拉丁语-1）将始终有效，但它通常是错误的。问题是它可以从任何编码中解码任何字节，但如果原始文本不是真正的拉丁语-1，它将解码为垃圾。你需要知道真正的编码，而不仅仅是猜测；UTF-8主要是自我检查，所以不太可能解码二进制乱码，但拉丁语-1会很高兴地将二进制乱码解码为文本乱码，并且永远不会低声抱怨。@ShadowRanger感谢您的解释。如果你真的在使用Python 3，这将极大地改变你的行为；以二进制模式打开意味着您不仅无法获得行尾翻译（承认这只是Windows上的一个问题），而且还可以返回

字节

对象，而不是

str

（如果要使用

str

，必须手动

解码它们）。我发布了（假设您知道编码，您需要知道编码才能执行解码
）。我喜欢这个解决方案的简单性，但我只是在python 3.6.8中尝试了它，但失败了。@M.H:它可以处理UTF-8数据。如果不是UTF-8，你需要弄清楚它是什么。这在3.6.8上的效果与在任何其他3.x版本上的效果一样好（在Python2.6+上，如果您从io import open

执行

操作，以Py3版本替换Py2open）。如果你不知道编码，你就只能猜测了。
>>> with open('evil_unicode.txt', 'r', errors='ignore') as f:
...     lines = f.readlines()
...
>>> lines
['abc\n', 'line2\n', 'line3']