Python 解析xml文件时获取unicode错误_Python_Xml

Python 解析xml文件时获取unicode错误

python xml

Python 解析xml文件时获取unicode错误,python,xml,Python,Xml,我有一个xml文件目录，其中xml文件的形式如下： <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?> <root> <document> <sentences> <sentence id="1"> <tokens>

我有一个xml文件目录，其中xml文件的形式如下：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Brand</word>
            <lemma>brand</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>5</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
          </token>
          <token id="2">
            <word>Blogs</word>
            <lemma>blog</lemma>
            <CharacterOffsetBegin>6</CharacterOffsetBegin>
            <CharacterOffsetEnd>11</CharacterOffsetEnd>
            <POS>NNS</POS>
            <NER>O</NER>
          </token>
          <token id="3">
            <word>Capture</word>
            <lemma>capture</lemma>
            <CharacterOffsetBegin>12</CharacterOffsetBegin>
            <CharacterOffsetEnd>19</CharacterOffsetEnd>
            <POS>VBP</POS>
            <NER>O</NER>
          </token>

但是，我得到了这个错误：

File "prac31.py", line 898, in main
    v = find_top_words('/home/xyz/xml_dir')
  File "prac31.py", line 43, in find_top_words
    file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)

这意味着什么以及如何修复它？

不要使用BeautifulSoup，它完全不推荐使用。为什么不是标准库？如果您想要更复杂的xml处理，可以使用lxml（但我很确定您没有）

这将很容易解决你的问题

编辑：忘了预览答案吧，那很糟糕-_- 您的问题是python 2中的str（my_字符串）。如果my_字符串包含非ascii字符，因为python 2中unicode字符串上的str（）类似于尝试编码为ascii，请使用方法encode（'utf-8'）。

str（）函数编码ascii编解码器并作为

word.string.strip（）

不会在xml文件中发现此错误的某些地方返回ascii字符。解决方案是使用：

file_list.append(word.string.strip().encode('utf-8'))

要返回此值，您需要执行以下操作：

for item in file_list:
    print item.decode('utf-8')

希望有帮助。

在这行代码中：

file_list.append(str(word.string.strip()))

为什么要使用

str

？数据是Unicode，您可以将Unicode字符串附加到列表中。如果您需要bytestring，那么您可以使用

word.string.strip（）.encode（'utf8'）

来代替它。

您能告诉我如何使用lxml吗？您认为BeautifulSoup“完全不推荐”是什么意思？它已经有一段时间没有发布了，但那不是一回事。对于XML使用类似于

lxml

的东西有很好的理由，但我不确定不赞成是其中之一。但我再次建议您使用标准库，如果您的XML是正确的，这是最简单的解决方案there@Chris：你说得对，这更像是基于观点的陈述，而且可能也是extreme@LudovicViaud：我不太明白（从我的代码中，在（str（word.string.strip（））中），如果我删除str，那么我会得到像[u'learning'，u'charged'，u'h.I.v'，u'maintenance'，u'unspecial'…]这样的输出。）这是unicode格式的，难道没有任何方法可以使这项工作正常进行，并获得word而不是unicode？我想知道为什么使用decode？只有encode是以文字形式提供的。：oDecodes obj使用注册用于编码的编解码器。请看这里（）

file_list.append(str(word.string.strip()))