Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/305.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 解析包含许多子代和子代的XML文件_Python_Xml_Parsing - Fatal编程技术网

Python 解析包含许多子代和子代的XML文件

Python 解析包含许多子代和子代的XML文件,python,xml,parsing,Python,Xml,Parsing,我对python非常陌生。我有一个非常大的xml文件,我想从中提取一些数据。以下是摘录: <program> <id>38e072a7-8fc9-4f9a-8eac-3957905c0002</id> <programID>3853</programID> <orchestra>New York Philharmonic</orchestra> <season>1842

我对python非常陌生。我有一个非常大的xml文件,我想从中提取一些数据。以下是摘录:

<program>
    <id>38e072a7-8fc9-4f9a-8eac-3957905c0002</id>
    <programID>3853</programID>
    <orchestra>New York Philharmonic</orchestra>
    <season>1842-43</season>
    <concertInfo>
        <eventType>Subscription Season</eventType>
        <Location>Manhattan, NY</Location>
        <Venue>Apollo Rooms</Venue>
        <Date>1842-12-07T05:00:00Z</Date>
        <Time>8:00PM</Time>
    </concertInfo>
    <worksInfo>
        <work ID="52446*">
            <composerName>Beethoven,  Ludwig  van</composerName>
            <workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle>
            <conductorName>Hill, Ureli Corelli</conductorName>
        </work>
        <work ID="8834*4">
            <composerName>Weber,  Carl  Maria Von</composerName>
            <workTitle>OBERON</workTitle>
            <movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement>
            <conductorName>Timm, Henry C.</conductorName>
            <soloists>
                <soloist>
                    <soloistName>Otto, Antoinette</soloistName>
                    <soloistInstrument>Soprano</soloistInstrument>
                    <soloistRoles>S</soloistRoles>
                </soloist>
            </soloists>
        </work>
        <work ID="3642*">
            <composerName>Hummel,  Johann</composerName>
            <workTitle>QUINTET, PIANO, D MINOR, OP. 74</workTitle>
            <soloists>
                <soloist>
                    <soloistName>Scharfenberg, William</soloistName>
                    <soloistInstrument>Piano</soloistInstrument>
                    <soloistRoles>A</soloistRoles>
                </soloist>
                <soloist>
                    <soloistName>Hill, Ureli Corelli</soloistName>
                    <soloistInstrument>Violin</soloistInstrument>
                    <soloistRoles>A</soloistRoles>
                </soloist>
                <soloist>
                    <soloistName>Derwort, G. H.</soloistName>
                    <soloistInstrument>Viola</soloistInstrument>
                    <soloistRoles>A</soloistRoles>
                </soloist>
                <soloist>
                    <soloistName>Boucher, Alfred</soloistName>
                    <soloistInstrument>Cello</soloistInstrument>
                    <soloistRoles>A</soloistRoles>
                </soloist>
                <soloist>
                    <soloistName>Rosier, F. W.</soloistName>
                    <soloistInstrument>Contrabass</soloistInstrument>
                    <soloistRoles>A</soloistRoles>
                </soloist>
            </soloists>
        </work>
        <work ID="0*">
            <interval>Intermission</interval>
        </work>
        <work ID="8834*3">
            <composerName>Weber,  Carl  Maria Von</composerName>
            <workTitle>OBERON</workTitle>
            <movement>Overture</movement>
            <conductorName>Etienne, Denis G.</conductorName>
        </work>
        <work ID="8835*1">
            <composerName>Rossini,  Gioachino</composerName>
            <workTitle>ARMIDA</workTitle>
            <movement>Duet</movement>
            <conductorName>Timm, Henry C.</conductorName>
            <soloists>
                <soloist>
                    <soloistName>Otto, Antoinette</soloistName>
                    <soloistInstrument>Soprano</soloistInstrument>
                    <soloistRoles>S</soloistRoles>
                </soloist>
                <soloist>
                    <soloistName>Horn, Charles Edward</soloistName>
                    <soloistInstrument>Tenor</soloistInstrument>
                    <soloistRoles>S</soloistRoles>
                </soloist>
            </soloists>
        </work>
        <work ID="8837*6">
            <composerName>Beethoven,  Ludwig  van</composerName>
            <workTitle>FIDELIO, OP. 72</workTitle>
            <movement>"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)</movement>
            <conductorName>Timm, Henry C.</conductorName>
            <soloists>
                <soloist>
                    <soloistName>Horn, Charles Edward</soloistName>
                    <soloistInstrument>Tenor</soloistInstrument>
                    <soloistRoles>S</soloistRoles>
                </soloist>
            </soloists>
        </work>
        <work ID="8336*4">
            <composerName>Mozart,  Wolfgang  Amadeus</composerName>
            <workTitle>ABDUCTION FROM THE SERAGLIO,THE, K.384</workTitle>
            <movement>"Ach Ich liebte," Konstanze (aria)</movement>
            <conductorName>Timm, Henry C.</conductorName>
            <soloists>
                <soloist>
                    <soloistName>Otto, Antoinette</soloistName>
                    <soloistInstrument>Soprano</soloistInstrument>
                    <soloistRoles>S</soloistRoles>
                </soloist>
            </soloists>
        </work>
        <work ID="5543*">
            <composerName>Kalliwoda,  Johann  W.</composerName>
            <workTitle>OVERTURE NO. 1, D MINOR, OP. 38</workTitle>
            <conductorName>Timm, Henry C.</conductorName>
        </work>
    </worksInfo>
</program>
<program>
当我运行这段代码时,我只得到最后一个soloistName和soloistInstrumet。我心目中的结果有点像对每个项目的重复观察。所以我会有这样的想法:

13918年,纽约爱乐乐团,1842-1843年,订阅季,52446*,奥托,安托瓦内特,女高音,S

13918,…,3642*,沙尔芬伯格,威廉,钢琴,A

13918,…,3642*,希尔,乌雷利·科雷利,小提琴,A

依此类推,直到最后一个工作ID:

13918,…,8336*4,奥托,安托瓦内特,女高音,S

我得到的只是最后的工作:

13918年,纽约爱乐乐团,1842-1843年,订阅季,8336*,奥托,安托瓦内特,女高音,S


文件中有超过15000个程序,如我发布的示例。我想解析所有这些,并提取上面提到的信息。我不完全确定如何去做这件事,我已经在互联网上搜索了一种方法来做这件事,但我尝试的一切都不起作用

您的问题在于您误解了循环的工作方式。具体来说,这些值仅在循环中更改:

for x in range(10):
    pass

print(x) # prints 9
vs

这是两件不同的事情。你在做前者。您需要做的是这样的事情:

with open('nyphil.txt', 'w') as f:
    nyphilwriter = csv.writer(f)        
    for program in root.iter('program'):
        id_ = program.findtext('id')
        program_id = program.findtext('programID')
        orchestra = program.findtext('orchestra')
        season = program.findtext('season')
        for concert in program.findall('concertInfo'):
            event = concert.findtext('eventType')
        for info in program.findall('worksInfo'):
            for work in info.iter('work'):
                work_id = work.get('ID')
                for soloists in work.iter('soloists'):
                    for soloist in soloists.iter('soloist'):
                        # Change this line to whatever you want to write out
                        nyphilwriter.writerow([id, program_id, orchestra, season, event, work_id, soloist.findtext('soloistName')])

这里的问题是您误解了循环的工作方式。具体来说,这些值仅在循环中更改:

for x in range(10):
    pass

print(x) # prints 9
vs

这是两件不同的事情。你在做前者。您需要做的是这样的事情:

with open('nyphil.txt', 'w') as f:
    nyphilwriter = csv.writer(f)        
    for program in root.iter('program'):
        id_ = program.findtext('id')
        program_id = program.findtext('programID')
        orchestra = program.findtext('orchestra')
        season = program.findtext('season')
        for concert in program.findall('concertInfo'):
            event = concert.findtext('eventType')
        for info in program.findall('worksInfo'):
            for work in info.iter('work'):
                work_id = work.get('ID')
                for soloists in work.iter('soloists'):
                    for soloist in soloists.iter('soloist'):
                        # Change this line to whatever you want to write out
                        nyphilwriter.writerow([id, program_id, orchestra, season, event, work_id, soloist.findtext('soloistName')])

13918不会出现在您的数据中。撇开这一点不谈,下面是我写的,它似乎成功地处理了您的数据

from lxml import etree

tree = etree.parse('test.xml')
programs = tree.xpath('.//program')

for program in programs:
    programID, orchestra, season = [program.xpath(_)[0].text for _ in ['programID', 'orchestra', 'season']]
    print (programID, orchestra, season)
    works = program.xpath('worksInfo/work')
    for work in works:
        workID = work.attrib['ID']
        soloistItems = work.xpath('soloists/soloist')
        for soloistItem in soloistItems:
            print (workID, soloistItem.find('soloistName').text, soloistItem.find('soloistInstrument').text, soloistItem.find('soloistRoles').text)
该脚本生成以下输出

3853 New York Philharmonic 1842-43
8834*4 Otto, Antoinette Soprano S
3642* Scharfenberg, William Piano A
3642* Hill, Ureli Corelli Violin A
3642* Derwort, G. H. Viola A
3642* Boucher, Alfred Cello A
3642* Rosier, F. W. Contrabass A
8835*1 Otto, Antoinette Soprano S
8835*1 Horn, Charles Edward Tenor S
8837*6 Horn, Charles Edward Tenor S
8336*4 Otto, Antoinette Soprano S

还有一件事需要注意:我在XML的开头放了一个标记,在末尾放了一个标记,因为实际数据将包含多个元素。

13918不会出现在数据中。撇开这一点不谈,下面是我写的,它似乎成功地处理了您的数据

from lxml import etree

tree = etree.parse('test.xml')
programs = tree.xpath('.//program')

for program in programs:
    programID, orchestra, season = [program.xpath(_)[0].text for _ in ['programID', 'orchestra', 'season']]
    print (programID, orchestra, season)
    works = program.xpath('worksInfo/work')
    for work in works:
        workID = work.attrib['ID']
        soloistItems = work.xpath('soloists/soloist')
        for soloistItem in soloistItems:
            print (workID, soloistItem.find('soloistName').text, soloistItem.find('soloistInstrument').text, soloistItem.find('soloistRoles').text)
该脚本生成以下输出

3853 New York Philharmonic 1842-43
8834*4 Otto, Antoinette Soprano S
3642* Scharfenberg, William Piano A
3642* Hill, Ureli Corelli Violin A
3642* Derwort, G. H. Viola A
3642* Boucher, Alfred Cello A
3642* Rosier, F. W. Contrabass A
8835*1 Otto, Antoinette Soprano S
8835*1 Horn, Charles Edward Tenor S
8837*6 Horn, Charles Edward Tenor S
8336*4 Otto, Antoinette Soprano S

还有一件事需要注意:我在XML的开头放了一个标记,在末尾放了一个标记,因为实际数据将包含多个元素。

非常感谢!!!这正是我需要的。我对这一切都很陌生,我确实对循环的实际工作方式非常困惑。这是一个巨大的帮助,谢谢!!如果这个答案是最能解决您问题的答案,您应该通过按下左边的复选标记将其标记为接受
嗨,韦恩,我对它投了更高的票,但我的声誉不足15,因此没有记录:/但是您的答案非常有用!非常感谢你!!!这正是我需要的。我对这一切都很陌生,我确实对循环的实际工作方式非常困惑。这是一个巨大的帮助,谢谢!!如果这个答案是最能解决您问题的答案,您应该通过按下左边的复选标记将其标记为接受
嗨,韦恩,我对它投了更高的票,但我的声誉不足15,因此没有记录:/但是您的答案非常有用!