Python 如何制作lxml';是否忽略无效的XML字符?

Python 如何制作lxml';是否忽略无效的XML字符?,python,xml,xml-parsing,lxml,Python,Xml,Xml Parsing,Lxml,我有一个包含无效字符的XML。 LXML的XMLParser在这些无效字符上抛出异常,但是当我使用recover=True选项创建XMLParser时,它会忽略坏字符并正常工作 我的问题是如何为lxml的iterparse函数设置类似的标志 复制: 断开的XML(/tmp/z.XML): 坏字符: 注意:在“坏字符:”字符串之后有两个ASCII字符#31(0x1F),我无法复制粘贴到这里 XMLParser的分析错误: fd = open('/tmp/z.xml') parser = etr

我有一个包含无效字符的XML。 LXML的XMLParser在这些无效字符上抛出异常,但是当我使用recover=True选项创建XMLParser时,它会忽略坏字符并正常工作

我的问题是如何为lxml的iterparse函数设置类似的标志

复制:

断开的XML(/tmp/z.XML):


坏字符:
注意:在“坏字符:”字符串之后有两个ASCII字符#31(0x1F),我无法复制粘贴到这里

XMLParser的分析错误:

fd = open('/tmp/z.xml')
parser = etree.XMLParser()
tree   = etree.parse(fd, parser)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 2576, in lxml.etree.parse (src/lxml/lxml.etree.c:22796)
  File "parser.pxi", line 1488, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60390)
  File "parser.pxi", line 1518, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:60687)
  File "parser.pxi", line 1401, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:59658)
  File "parser.pxi", line 991, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:57303)
  File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512)
  File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372)
  File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21
fd=open('/tmp/z.xml')
parser=etree.XMLParser()
tree=etree.parse(fd,parser)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
lxml.etree.parse(src/lxml/lxml.etree.c:22796)中的文件“lxml.etree.pyx”,第2576行
文件“parser.pxi”,第1488行,在lxml.etree.\u parseDocument(src/lxml/lxml.etree.c:60390)中
文件“parser.pxi”,第1518行,在lxml.etree.\u parseFilelikeDocument(src/lxml/lxml.etree.c:60687)中
文件“parser.pxi”,第1401行,在lxml.etree._parseDocFromFilelike(src/lxml/lxml.etree.c:59658)中
文件“parser.pxi”,第991行,在lxml.etree.\u BaseParser.\u parseDocFromFilelike(src/lxml/lxml.etree.c:57303)中
文件“parser.pxi”,第538行,在lxml.etree.\u ParserContext.\u handleParseResultDoc(src/lxml/lxml.etree.c:53512)中
lxml.etree.中的文件“parser.pxi”,第624行。\u handleParseResult(src/lxml/lxml.etree.c:54372)
文件“parser.pxi”,第564行,在lxml.etree中。\u raiseParserError(src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError:PCDATA无效字符值31,第4行,第21列
要忽略错误字符,我设置了recover=True,它工作正常:

import lxml.etree as etree
fd = open('/tmp/z.xml')
parser = etree.XMLParser(recover=True)
tree   = etree.parse(fd, parser)
etree.tostring(tree)

# OUTPUT:
<items>\n\t<item>\n\t\t<B>Bad characters:</B>\n\t</item>\n</items>'
将lxml.etree导入为etree
fd=open('/tmp/z.xml')
parser=etree.XMLParser(recover=True)
tree=etree.parse(fd,parser)
etree.tostring(树)
#输出:
\n\t\n\t\t无效字符:\n\t\n
使用iterparse时,我再次遇到相同的错误,但如何使其忽略坏字符

fd = open('/tmp/z.xml')
it = etree.iterparse(fd, events=("start", "end"))
for e in it: print e
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245)
  File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21
fd=open('/tmp/z.xml')
it=etree.iterparse(fd,events=(“开始”,“结束”))
对于其中的e:打印e
...
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“iterparse.pxi”,第498行,在lxml.etree.iterparse.\uu下一个\uuuuu(src/lxml/lxml.etree.c:73245)
文件“parser.pxi”,第564行,在lxml.etree中。\u raiseParserError(src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError:PCDATA无效字符值31,第4行,第21列

阅读最后一篇文章。看起来这在iterparse中是不可能的。看起来像是的复制品,但这个问题没有被接受或投票赞成的答案。我怀疑答案是@pypat说的。@mzjn:对,似乎是同一个问题。
fd = open('/tmp/z.xml')
it = etree.iterparse(fd, events=("start", "end"))
for e in it: print e
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245)
  File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21