Python 用于中世纪角色的UnicodeDecodeError_Python_Unicode_Encoding_Utf 8_Nltk

Python 用于中世纪角色的UnicodeDecodeError

python unicode encoding utf-8

Python 用于中世纪角色的UnicodeDecodeError,python,unicode,encoding,utf-8,nltk,Python,Unicode,Encoding,Utf 8,Nltk,我正在尝试对中世纪文本运行nltk标记化程序。这些文本使用了中世纪的字符，如yogh（ȝ）、thorn（þ）和eth（ð）当我使用标准unicode（utf-8）编码运行程序（粘贴在下面）时，我得到以下错误：回溯（最近一次呼叫最后一次）：文件“me\u scraper\u redux2.py”，第11行，在 tokens=nltk.word\u tokenize（打开（“ME\u Corpus\u sm/”+文件，encoding=“utf\u 8”）.read（））文件“/Librar

我正在尝试对中世纪文本运行nltk标记化程序。这些文本使用了中世纪的字符，如yogh（ȝ）、thorn（þ）和eth（ð）

当我使用标准unicode（utf-8）编码运行程序（粘贴在下面）时，我得到以下错误：

回溯（最近一次呼叫最后一次）：
文件“me\u scraper\u redux2.py”，第11行，在
tokens=nltk.word\u tokenize（打开（“ME\u Corpus\u sm/”+文件，encoding=“utf\u 8”）.read（））
文件“/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py”，第313行，在decode中
（结果，消耗）=自身缓冲区解码（数据，自身错误，最终）
UnicodeDecodeError:“utf-8”编解码器无法解码位置3131中的字节0x80:无效的开始字节

我尝试过其他编码，比如拉丁语1，等等，这些都可以避免这个问题，但是我没有得到准确的结果，因为这些编码使用其他字符来填充空格。我认为unicode可以处理这些字符。我是做错了什么，还是应该使用另一种编码？这些文件最初是utf-8格式的。请参见下面的我的代码：

import nltk
import os, os.path
import string

from nltk import word_tokenize
from nltk.corpus import stopwords

files = os.listdir("ME_Corpus_sm/")
for file in files:
    # open, parse, and normalize the tokens (words) in the file
    tokens = nltk.word_tokenize( open( "ME_Corpus_sm/"+file, encoding="utf_8" ).read() )
    tokens = [ token.lower() for token in tokens ]
    tokens = [ ''.join( character for character in token if character not in string.punctuation ) for token in tokens ]
    tokens = [ token for token in tokens if token.isalpha() ]
    tokens = [ token for token in tokens if not token in stopwords.words( 'english' ) ]

# output maximum most frequent tokens and their counts
    for tuple in nltk.FreqDist( tokens ).most_common( 50 ):
        word  = tuple[ 0 ]
        count = str( tuple[ 1 ] )
        print(word + "\t" + count)

您的文件不是有效的UTF-8

也许部分是UTF-8，部分是其他垃圾？你可以试试：

open(..., encoding='utf-8', errors='replace')

用问号替换非UTF-8序列，而不是引发错误，这可能会让您有机会了解问题所在。一般来说，如果你在一个文件中混合使用了多种编码，那么你就注定要失败，因为它们无法可靠地分离出来。

你能发布一个非常小的文本摘录，比如说，包含一个thorn（可能是所述摘录的二进制十六进制或base64编码）吗？错误（“无效的起始字节0x80”）似乎指向无效的UTF-8，因为0x80是一个10xxxxxx字节，应该是一个延续代码，永远不会在令牌的起始处找到。在ISO-8859-15（拉丁文1）文本中可能会遇到这种情况，不过…仅供参考，Unicode不是UTF-8。马丁，谢谢你，我还在为这些事情绞尽脑汁呢！莱瑟尼，谢谢你的帮助。希望bobince的回应能让我们暂时不用检查二进制文件：）bobince，不管出于什么原因，我还没有这么做。我现在已经这样做了，效果很好。所有的荆棘、道德和瑜伽士都表现得很好，所以我不确定问题出在哪里。粗略地看一下文本（大约2000行长），我甚至看不到任何明显的与标点符号无关的问号。非常感谢你！