在Python语料库中查找损坏的文件_Python_Nltk_Corpus

在Python语料库中查找损坏的文件

python

在Python语料库中查找损坏的文件,python,nltk,corpus,Python,Nltk,Corpus,我正在使用Python的NLTK TaggedCorpusReader创建文本文件的语料库。但是，其中一个文件不是utf-8格式，或者具有不受支持的字符。有没有办法知道哪个文件包含问题这是我的密码： import nltk corpus=nltk.corpus.TaggedCorpusReader("filepath", '.*.txt', encoding='utf-8') #I added the encoding when I saw some answer about that, bu

我正在使用Python的NLTK TaggedCorpusReader创建文本文件的语料库。但是，其中一个文件不是utf-8格式，或者具有不受支持的字符。有没有办法知道哪个文件包含问题

这是我的密码：

import nltk
corpus=nltk.corpus.TaggedCorpusReader("filepath", '.*.txt', encoding='utf-8') #I added the encoding when I saw some answer about that, but it doesn't seem to help
words=corpus.words()
for w in words:
    print(w)

我的错误：

UnicodeDecodeError:“utf-8”编解码器无法解码位置0处的字节0xa0:无效的开始字节

您可以通过一次读取一个文件来识别该文件，如下所示：

corpus = nltk.corpus.TaggedCorpusReader("filepath", r'.*\.txt', encoding='utf-8')

try: 
    for filename in corpus.fileids():
        words_ = corpus.words(filename)
except UnicodeDecodeError:
    print("UnicodeDecodeError in", filename)

（或者，您可以在读取之前打印每个文件名，甚至不用费心捕捉错误。）

一旦找到文件，就必须找出问题的根源。你的语料库真的是utf-8编码的吗？也许它使用了另一种8位编码，比如拉丁语-1或其他什么。指定8位编码不会给您带来错误（这些格式中没有错误检查），但是您可以让python打印一些行，看看所选编码是否正确

如果您的语料库几乎全是英文的，您可以在文件中搜索包含非ascii字符的行，并仅打印以下内容：

testreader = nltk.corpus.TaggedCorpusReader("filepath", r".*\.txt", encoding="Latin-1")

for line in testreader.raw(badfilename).splitlines():
    if re.search(r'[\x80-\xFF]', line)):
        print(line)

你能把你的输入文件放在某个地方吗？我们可以帮你检查一下以后是否有编码问题。你也在python3上吗？@alvas我做了更多的挖掘，问题是文件没有用utf-8编码。我正在使用Python 3。这正是我所需要的，谢谢！我发现了有问题的文件，并将其编码更改为utf-8，从而解决了问题。非常感谢你！