Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/wordpress/11.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在从PlaintextCorpusReader读取原始数据时消除UnicodeDecodeError_Python_Nltk_Python Unicode_Corpus - Fatal编程技术网

Python 如何在从PlaintextCorpusReader读取原始数据时消除UnicodeDecodeError

Python 如何在从PlaintextCorpusReader读取原始数据时消除UnicodeDecodeError,python,nltk,python-unicode,corpus,Python,Nltk,Python Unicode,Corpus,我正在以以下方式从一组文本文件创建语料库: newcorpus = PlaintextCorpusReader(corpus_root, '.*') 现在,我希望通过以下方式访问文件中的文字: text_bow = newcorpus.words("file_name.txt") 但我得到了以下错误: UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte 有多个文件

我正在以以下方式从一组文本文件创建语料库:

newcorpus = PlaintextCorpusReader(corpus_root, '.*')
现在,我希望通过以下方式访问文件中的文字:

text_bow = newcorpus.words("file_name.txt")
但我得到了以下错误:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

有多个文件抛出错误。我怎样才能摆脱这个独角兽

要消除解码错误,请执行以下操作之一

  • 以字节形式读取语料库文件,不要解码为unicode

  • 发现并使用用于文件的编码。(语料库医生应该告诉你。)我怀疑这是拉丁语-1

  • 使用拉丁语-1,而不考虑实际编码。这将消除异常,即使结果字符串没有原始内容是错误的


  • 首先,找到您的文件所使用的编码。可以尝试或询问您的数据来源

    然后使用
    PlaintextCorpusReader
    中的
    encoding=
    参数,例如,对于
    latin-1

    newcorpus = PlaintextCorpusReader(corpus_root, '.*', encoding='latin-1')
    
    从代码中:

    class PlaintextCorpusReader(CorpusReader):
    """
    Reader for corpora that consist of plaintext documents.  Paragraphs
    are assumed to be split using blank lines.  Sentences and words can
    be tokenized using the default tokenizers, or by custom tokenizers
    specificed as parameters to the constructor.
    This corpus reader can be customized (e.g., to skip preface
    sections of specific document formats) by creating a subclass and
    overriding the ``CorpusView`` class variable.
    """
    
    CorpusView = StreamBackedCorpusView
    """The corpus view class used by this reader.  Subclasses of
       ``PlaintextCorpusReader`` may specify alternative corpus view
       classes (e.g., to skip the preface sections of documents.)"""
    
    def __init__(self, root, fileids,
                 word_tokenizer=WordPunctTokenizer(),
                 sent_tokenizer=nltk.data.LazyLoader(
                     'tokenizers/punkt/english.pickle'),
                 para_block_reader=read_blankline_block,
                 encoding='utf8'):