Python 逐块解析xml文件并获取每个块中的值
我有一个10 GB的xml文件,其中包括不同块的列表。以下是我的文件的一个片段:Python 逐块解析xml文件并获取每个块中的值,python,xml,parsing,xml-parsing,Python,Xml,Parsing,Xml Parsing,我有一个10 GB的xml文件,其中包括不同块的列表。以下是我的文件的一个片段: <image> <ref>www.test.com</ref> <label/> <number>0</number> <ID>ID0</ID> <name>test1</name> <comment> <line
<image>
<ref>www.test.com</ref>
<label/>
<number>0</number>
<ID>ID0</ID>
<name>test1</name>
<comment>
<line number="0">This is a comment</line>
<line number="1">This is also another comment</line>
</comment>
<creationDate>2017-02-13T15:46:16-04:00</creationDate>
</image>
<result>
<ref>www.test1.com</ref>
<label/>
<number>001</number>
<ID>RE1</ID>
<name>test2</name>
<comment>
<line number="0">This is a comment2</line>
</comment>
<creationDate>2017-01-13T15:46:16-04:00</creationDate>
</result>
<image>
<ref>www.test3.com</ref>
<label/>
<number>1</number>
<ID>ID1</ID>
<value>10030</value>
<name>test3</name>
<comment>
<line number="0">This is a comment3</line>
</comment>
<creationDate>2017-04-13T15:46:16-04:00</creationDate>
</image>
你能帮我解决这个问题吗。我对xml解析一无所知。
我还想将每个解析的块转换为python中的字典。可能吗?它不是“逐行”读取XML文件。它在每个元素的末尾返回一个
end
事件。也就是说,如果您的输入文件如下所示:
<data>
<widgets location="earth">
<widget name="gizmo"/>
<widget name="gadget"/>
<widget name="thingamajig"/>
</widgets>
</data>
如果需要,还可以在每个元素的开头添加,如下所示:
for event, element in etree.iterparse(fd, events=('start', 'end')):
print event, element
其输出为:
end <Element widget at 0x7f31e3132488>
end <Element widget at 0x7f31e3123f38>
end <Element widget at 0x7f31e3123ef0>
end <Element widgets at 0x7f31e31327a0>
end <Element data at 0x7f31e31324d0>
start <Element data at 0x7fccf78cc518>
start <Element widgets at 0x7fccf78cc7e8>
start <Element widget at 0x7fccf78cc4d0>
end <Element widget at 0x7fccf78cc4d0>
start <Element widget at 0x7fccf78bdf80>
end <Element widget at 0x7fccf78bdf80>
start <Element widget at 0x7fccf78bdf38>
end <Element widget at 0x7fccf78bdf38>
end <Element widgets at 0x7fccf78cc7e8>
end <Element data at 0x7fccf78cc518>
其输出为:
{'earth': ['gizmo', 'gadget', 'thingamajig']}
我希望这能让您了解如何处理输入文件中的每个感兴趣的块
from lxml import etree
with open('data2.xml') as fd:
widgets = {}
loc = None
for event, element in etree.iterparse(fd, events=('start', 'end')):
if event == 'start' and element.tag == 'widgets':
loc = element.get('location')
widgets[loc] = []
elif event == 'end' and element.tag == 'widget':
widgets[loc].append(element.get('name'))
print widgets
{'earth': ['gizmo', 'gadget', 'thingamajig']}