Python 为什么XML解析如此困难？_Python_Xml_Python 3.x

Python 为什么XML解析如此困难？

python xml python-3.x

Python 为什么XML解析如此困难？,python,xml,python-3.x,Python,Xml,Python 3.x,我试图解析从EPO-OPS收到的这个简单文档 <?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> <ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http:

我试图解析从EPO-OPS收到的这个简单文档

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="2"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1">
            <abstract lang="en">
                <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p>
            </abstract>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>

有没有类似于json的简单方法：

print (root.exchange-documents.exchange-document.abstract.p.text)

使用BeautifulSoup要容易得多。试试这个：

from bs4 import BeautifulSoup

xml = """<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="2"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="19768124" country="EP" doc-number="1000000" kind="A1">
            <abstract lang="en">
                <p>The invention relates to an apparatus (1) for manufacturing green bricks from clay for the brick manufacturing industry, comprising a circulating conveyor (3) carrying mould containers combined to mould container parts (4), a reservoir (5) for clay arranged above the mould containers, means for carrying clay out of the reservoir (5) into the mould containers, means (9) for pressing and trimming clay in the mould containers, means (11) for supplying and placing take-off plates for the green bricks (13) and means for discharging green bricks released from the mould containers, characterized in that the apparatus further comprises means (22) for moving the mould container parts (4) filled with green bricks such that a protruding edge is formed on at least one side of the green bricks. &lt;IMAGE></p>
            </abstract>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>"""

如果您喜欢一行程序：

print('\n'.join([i.text for i in BeautifulSoup(xml).find_all('abstract')]))

可以将XPath表达式与ElementTree一起使用。请注意，因为您有一个用

xmlns

定义的全局XML命名空间，所以需要指定该URL：

tree = ElementTree.parse(…)

namespaces = { 'ns': 'http://www.epo.org/exchange' }
paragraphs = tree.findall('.//ns:abstract/ns:p', namespaces)
for paragraph in paragraphs:
     print(paragraph.text)

@Scripting.FileSystemObject就是这个。是的，您可以在这里找到它的文档：我们不能使用getroot（）摆脱名称空间吗？不，ElementTree的核心内置了名称空间，并且（正确地）会一直尊重这些名称空间。您可以在解析为之后删除名称空间，但是没有内置的解决方案可以忽略它们。

print('\n'.join([i.text for i in BeautifulSoup(xml).find_all('abstract')]))

tree = ElementTree.parse(…)

namespaces = { 'ns': 'http://www.epo.org/exchange' }
paragraphs = tree.findall('.//ns:abstract/ns:p', namespaces)
for paragraph in paragraphs:
     print(paragraph.text)