代码在使用python解析大型xml文件时运行缓慢_Python_Xml_Parsing_Lxml_Large Files

代码在使用python解析大型xml文件时运行缓慢

python xml parsing

代码在使用python解析大型xml文件时运行缓慢,python,xml,parsing,lxml,large-files,Python,Xml,Parsing,Lxml,Large Files,我有两个非常大的xml文件，它们为相同的地点/建筑/房间组合保存不同的数据。目前，我正在对第一个大文件使用python etree parse，然后在其中循环提取位置/建筑/房间ID（以及其他信息），然后使用这些ID在第二个大xml文件中循环（与第一个文件的结构相同）我现在使用lxml iterparse查找并提取第二个文件中与第一个文件中特定位置相关的Place元素。然后它循环通过那个place元素来找到它工作的相关数据，但是随着我循环到第一个文件中越来越远，它继续变得越来越慢我已经尽我所能

我有两个非常大的xml文件，它们为相同的地点/建筑/房间组合保存不同的数据。目前，我正在对第一个大文件使用python etree parse，然后在其中循环提取位置/建筑/房间ID（以及其他信息），然后使用这些ID在第二个大xml文件中循环（与第一个文件的结构相同）我现在使用lxml iterparse查找并提取第二个文件中与第一个文件中特定位置相关的Place元素。然后它循环通过那个place元素来找到它工作的相关数据，但是随着我循环到第一个文件中越来越远，它继续变得越来越慢

我已经尽我所能清除了第二个大文件的iterparse中不相关的（）元素，这很有帮助，但我有5000个位置需要循环，前100个位置处理得非常快（不到一分钟），接下来的400个需要30分钟，依此类推。15个小时后，我来到了大约4000个设施，行动非常缓慢。我怀疑其中一个文件的解析包含了太多的数据

下面是使用一般化xml的简化代码（很抱歉，我无法进一步简化它）

largefile1 = "largefile1.xml"
largefile2 = "largeFile2.xml"

ptree = ET.parse (largefile1)
proot = ptree.getroot()

o = open('output.txt', 'w')

def get_place_elem(pplaceid,largefile2):
    Placenode = ET.iterparse(Largefile2, events=("end",), tag='Place')

    for event, Place in Placenode:
        for PlaceId in Place.findall('PlaceIdentification'):
            placeid = PlaceId.find('PlaceIdentifier').text
                if placeid == pplaceid:
                    del Placenode
                    return Place
        Place.clear()
        while Place.getprevious() is not None:
            del Place.getparent()[0]
    del Placenode

def getfacdata(pplaceid,pbuildid,proomid,Place):

    for Build in Place.findall('Building'):
        euid = ' '
        for BuildId in Build.findall('BuildingIdentification'):
            bid = BuildId.find('Identifier').text
        if bid ==pbid:
            for Room in Build.findall('Room'):
                roomid = ' '
                for RoomId in Room.findall('RoomIdentification'):
                    roomid = RoomId.find('Identifier').text
                    if roomid == proomid:

                        ...Collect data from Room element...
                        ... do some simple math with if statements
                        return data; # list of 15 data values

for pPlace in proot.findall('.//Place'):
    for pPlaceId in pPlace.findall('PlaceIdentification'):
        pplaceid = pPlaceId.find('PlaceIdentifier').text
            if placeid == pplaceid:
                placecnt += 1
                #... get some data

    for pBuild in pPlace.findall('Buidling'):
        for pBuildId in pBuild.findall('BuildingIdentification'):
            pbid = pBuildId.find('Identifier').text

        for pRoom in pBuild.findall('Room'):
                for pRoomId in pRoom.findall('RoomIdentification'):
                    proomid = pRoom.find('Identifier').text

                    if prevpplaceid != pplaceid:
                        if placecnt != 1:Place.clear()
                            Place = get_fac_elem(pplaceid,largefile2)
                            prevpplaceid = pplaceid

                    data = getfacdata(pplaceid,pbid,proomid,Place)

                    #...Collect data from Room element...
                    #... do some simple math with if statements    
                    writer = csv.writer(o)
                    writer.writerow( ( # data from proom and from 'data' list from processing largefile2 in csv format##))
                    break    
    prevpplaceid = pplaceid


o.close()

一般化xml

 <Payload>
<Place>
    <PlaceName>Place1</PlaceName>
    <PlaceStatusCode>OP</PlaceStatusCode>
    <PlaceStatusCodeYear>2011</PlaceStatusCodeYear>
    <PlaceComment/>
    <PlaceIdentification>
        <PlaceIdentifier>id001</PlaceIdentifier>
        <StateAndCountyFIPSCode>77702</StateAndCountyFIPSCode>
    </PlaceIdentification>
    <PlaceAddress>
        <LocationAddressText>111 Main</LocationAddressText>
        <SupplementalLocationText/>
        <LocalityName>City1</LocalityName>
        <LocationAddressStateCode>State1</LocationAddressStateCode>
        <LocationAddressPostalCode>12345</LocationAddressPostalCode>
        <LocationAddressCountryCode>USA</LocationAddressCountryCode>
    </PlaceAddress>
    <PlaceGeographicCoordinates>
        <LatitudeMeasure>88.888</LatitudeMeasure>
        <LongitudeMeasure>-99.999</LongitudeMeasure>
    </PlaceGeographicCoordinates>
    <Building>
        <BuildingDescription>Building1</BuildingDescription>
        <BuildingTypeCode>999</BuildingTypeCode>
        <BuildingIdentification>
            <Identifier>Building1</Identifier>
        </BuildingIdentification>
        <Room>
            <RoomIdentification>
                <Identifier>Room1</Identifier>
            </RoomIdentification>
            ... More data ...
        </Room>
        <Room>
            <RoomIdentification>
                <Identifier>Room2</Identifier>
            </RoomIdentification>
            ... More data ...
        </Room>
        ...
    </Building>
    <Building>
        <BuildingDescription>Building2</BuildingDescription>
        <BuildingTypeCode>999</BuildingTypeCode>
        <BuildingIdentification>
            <Identifier>Building2</Identifier>
        </BuildingIdentification>
        <Room>
            <RoomIdentification>
                <Identifier>Room1</Identifier>
            </RoomIdentification>
            ... More data ...
        </Room>
        <Room>
            <RoomIdentification>
                <Identifier>Room4</Identifier>
            </RoomIdentification>
            ... More data ...
        </Room>
        ...
    </Building>
    ...
</Place>
<Place>
    ...
</Place>


地点1
操作
2011
id001
77702
111主要
城市1
状态1
12345
美国
88.888
-99.999
建筑物1
999
建筑物1
1号房间
... 更多数据。。。
房间2
... 更多数据。。。
...
建筑物2
999
建筑物2
1号房间
... 更多数据。。。
4号房间
... 更多数据。。。
...
...
...

请缩进代码好吗？使用python分析工具定位热点和内存消耗的位置。这就产生了足够的结果让您开始：我已经尽了最大努力重新缩进代码（Python代码中的制表符实际上不是一个好主意），但看起来仍然存在一些错误。你能验证并更正吗？对缩进进行了编辑，并进行了其他编辑，这些编辑在我试图清理（和泛化）发布代码时被忽略了。代码实际上是有效的，我只是在寻找与使用iterparse或常规解析解析时有效使用内存相关的建议。我将尝试建议的评测，并转达我的发现。谢谢看起来我已经失去了所有的缩进。现在没有时间修复。今天下午晚些时候。