Python 2.7中的UnicodeDecodeError_Python_Unicode_Python 2.7

Python 2.7中的UnicodeDecodeError

python unicode python-2.7

Python 2.7中的UnicodeDecodeError,python,unicode,python-2.7,Python,Unicode,Python 2.7,我试图用python读取一个utf-8编码的xml文件，并对从该文件读取的行进行处理，如下所示： next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1) 其中doc_content是从文件中读取的行，word_value是来自同一行的字符串之一。每当doc_内容或word_值包含一些Unicode字符时，我就会收到与上述行相关的编码错误。因此，我首先尝试使用utf-

我试图用python读取一个utf-8编码的xml文件，并对从该文件读取的行进行处理，如下所示：

next_sent_separator_index =  doc_content.find(word_value, int(characterOffsetEnd_value) + 1)

其中doc_content是从文件中读取的行，word_value是来自同一行的字符串之一。每当doc_内容或word_值包含一些Unicode字符时，我就会收到与上述行相关的编码错误。因此，我首先尝试使用utf-8解码（而不是默认的ascii编码）对它们进行解码，如下所示：

next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)

Traceback (most recent call last):
  File "snippetRetriver.py", line 402, in <module>
    sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
  File "snippetRetriver.py", line 201, in getSentenceList
    next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)

但我仍然得到以下UnicodeDecodeError：

next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)

Traceback (most recent call last):
  File "snippetRetriver.py", line 402, in <module>
    sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
  File "snippetRetriver.py", line 201, in getSentenceList
    next_sent_separator_index =  doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)

回溯（最近一次呼叫最后一次）：
文件“snippetRetriver.py”，第402行，在
句子列表，引理化的句子列表=getSentenceList（表格文档）
getSentenceList中第201行的文件“snippetRetriver.py”
下一个\u已发送\u分隔符\u索引=doc\u content.decode（'utf-8'）。查找（word\u value.decode（'utf-8'），int（characterOffsetEnd\u value）+1）
文件“/usr/lib/python2.7/encodings/utf_8.py”，第16行，解码
返回编解码器.utf_8_解码（输入，错误，真）
UnicodeEncodeError:“ascii”编解码器无法对位置8中的字符u'\xe9'进行编码：序号不在范围内（128）

有人能给我推荐一种合适的方法/途径来避免Python2.7中的这种编码错误吗

您已经拥有了Unicode，而不是UTF-8中的字节字符串。你无法进一步破译它。（虽然您可能首先想看看

u'\xe9'

是从哪里获得的；但您不太可能想要这个角色。

codecs.utf_8_decode(input.encode('utf8'))