Python 解析XML并存储值(如果值出现在列表中)

Python 解析XML并存储值(如果值出现在列表中),python,xml,pandas,collections,elementtree,Python,Xml,Pandas,Collections,Elementtree,我的以下XML文件总计为3 Gb,这就是我依赖解析的原因: <events version="1.0"> <event time="13834.0" type="actend" person="1537047" link="335909" facility="home811408" actType="home" /> <event time="13834.0" type="departure" person="1537047" link="3359

我的以下XML文件总计为3 Gb,这就是我依赖解析的原因:

<events version="1.0">
    <event time="13834.0" type="actend" person="1537047" link="335909" facility="home811408" actType="home"  />
    <event time="13834.0" type="departure" person="1537047" link="335909" legMode="car_passenger"  />
    <event time="14516.0" type="travelled" person="1537047" distance="9749.86232009391"  />
    <event time="14516.0" type="arrival" person="1537047" link="79554" legMode="car_passenger"  />
    <event time="14516.0" type="actstart" person="1537047" link="79554" facility="105155" actType="work"  />
    <event time="15380.0" type="actend" person="3716370" link="280959" facility="outside_484" actType="outside"  />
    <event time="15380.0" type="departure" person="3716370" link="280959" legMode="car"  />
    <event time="15380.0" type="PersonEntersVehicle" person="3716370" vehicle="3716370"  />
    <event time="15380.0" type="vehicle enters traffic" person="3716370" link="280959" vehicle="3716370" networkMode="car" relativePosition="1.0"  />
    <event time="15380.0" type="coldEmissionEvent" linkId="280959" vehicleId="3716370" NO2="0.00273337378166616" NOx="0.33" HC="3.78" CO="19.99" FC="23.79" PM="0.00789998099207878" NMHC="3.57"  />
    <event time="15381.0" type="left link" vehicle="3716370" link="280959"  />
    <event time="15381.0" type="entered link" vehicle="3716370" link="103801"  />
    <event time="15386.0" type="left link" vehicle="3716370" link="103801"  />
    <event time="15386.0" type="entered link" vehicle="3716370" link="502211"  />
    <event time="15386.0" type="warmEmissionEvent" linkId="103801" vehicleId="3716370" NO2="0.0016834393054024187" CO2_TOTAL="5.211468969715323" NOx="0.010865835516688339" SO2="2.6488925864494008E-5" HC="0.0029077588002405412" CO="0.02157863109652191" FC="1.6554329969579966" PM="4.59119810564296E-4" NMHC="0.002754718863385776"  />
    <event time="15391.0" type="left link" vehicle="3716370" link="502211"  />
</events>
我想要的是一个表,它向我显示person,如果它已经在XML中任何一个名为link的封闭链接上注册过。在最后一个表中,person的每个值都应该是唯一的。在输出中有链接不是强制性的,我只是想把它作为质量控制,看看代码是否有效

到目前为止,我的代码没有提供结果,主要是因为我不知道如何使其具有与列表中的任一值对应的发生的条件:

import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd

tree = ET.iterparse(gzip.open('V0_1pm/output_events.xml.gz', 'r'))
agents_o_i = defaultdict(list)
for xml_event, elem in tree:
    attributes = elem.attrib
    if elem.tag == 'event' and elem.attrib["link"] in closed_links:
         agents_o_i[attributes['person']].append(attributes['link'])

agents_o_i = pd.DataFrame.from_dict(agents_o_i, orient='index')
agents_o_i.to_csv("out/V1_10pct/traveltimes_V1.csv", sep=';')
期望输出:

person  link   
3716370 280959 
非常感谢您的帮助

由于缺少关键帧,您的if块正在崩溃

确保首先检查属性中是否有键

for xml_event, elem in tree:
    if elem.tag == 'event' \
    and 'person' in elem.attrib \
    and 'link' in elem.attrib \
    and elem.attrib['link'] in closed_links:
        agents_o_i[elem.attrib['person']].append(elem.attrib['link'])
到目前为止的结果是:

>>> print(agents_o_i)
defaultdict(list, {'3716370': ['280959', '280959', '280959']})
此外,您还可以用大致相同的方式手动解析文件

import gzip

agents_o_i = defaultdict(list)
with gzip.open('output_events.xml.gz','rb') as f:
    for line in f:
        if 'person' in line and 'link' in line:
            link = line.split('link="')[1].split('"')[0]
            if link in closed_links:
                person = line.split('person="')[1].split('"')[0]
                agents_o_i[person].append(link)

如果布洛克是马车,你就可以。您能发布该示例的预期输出吗?@alec_djinn谢谢您的输入,请查看更新问题的底部链接103801不在任何包含person的行中。你怎么可能在你的输出中有它呢?你是对的,我的错误。不幸的是,当我试图在我的大文件上运行它时,我的内存用完了。我用相同的缩进将elem.clear插入到溶液的最后一行下面。您对如何提高内存效率有什么建议吗?您可以逐行读取文件并使用自定义函数对其进行解析。这将避免加载内存中的所有数据。
import gzip

agents_o_i = defaultdict(list)
with gzip.open('output_events.xml.gz','rb') as f:
    for line in f:
        if 'person' in line and 'link' in line:
            link = line.split('link="')[1].split('"')[0]
            if link in closed_links:
                person = line.split('person="')[1].split('"')[0]
                agents_o_i[person].append(link)