UnicodeDecodeError使用for循环python3解析文件_Python_Python 2.7_Python 3.4_Unicode String

UnicodeDecodeError使用for循环python3解析文件

python python-2.7

UnicodeDecodeError使用for循环python3解析文件,python,python-2.7,python-3.4,unicode-string,Python,Python 2.7,Python 3.4,Unicode String,当我在文件中循环行时，我得到了UnicodeDecodeError with open(somefile,'r') as f: for line in f: #do something 当我使用Python3.4时，这种情况就发生了。一般来说，我有一些文件包含一些非UTF-8字符。我想逐行解析文件，找到出现问题的行，并在出现这种非utf-8的行中得到确切的索引。我已经为它准备好了代码，但它在Python2.7.9下工作，但在Python3.4下，当执行for循环时，我

当我在文件中循环行时，我得到了UnicodeDecodeError

with open(somefile,'r') as f:
    for line in f:
        #do something

当我使用Python3.4时，这种情况就发生了。一般来说，我有一些文件包含一些非UTF-8字符。我想逐行解析文件，找到出现问题的行，并在出现这种非utf-8的行中得到确切的索引。我已经为它准备好了代码，但它在Python2.7.9下工作，但在Python3.4下，当执行for循环时，我得到了UnicodeDecodeError。

有什么想法吗

您需要以二进制模式打开文件，并一次解码一行。试试这个：

with open('badutf.txt', 'rb') as f:
    for i, line in enumerate(f,1):
        try:
            line.decode('utf-8')
        except UnicodeDecodeError as e:
            print ('Line: {}, Offset: {}, {}'.format(i, e.start, e.reason))

以下是我在Python3中得到的结果：

Line: 16, Offset: 6, invalid start byte

果然，第16行，第6位是坏字节。

对于ind，枚举（f，1）：打印（ind）

中的行将给出行号。我几乎做到了这一点，但最终使用了

encoding='utf-8'

作为

open（）

的参数，这就解决了我的问题。