Python 遍历xml元素的有效方法_Python_Lxml

Python 遍历xml元素的有效方法

python

Python 遍历xml元素的有效方法,python,lxml,Python,Lxml,我有这样一个xml： <a> hello world </a> <x> <y></y> </x> <a> first second third </a> 它可以工作

我有这样一个xml：

<a>
    <b>hello</b>
    <b>world</b>
</a>
<x>
    <y></y>
</x>
<a>
    <b>first</b>
    <b>second</b>
    <b>third</b>
</a>

它可以工作，但我有相当大的文件，

cProfile

告诉我使用

xpath

非常昂贵

我想知道，也许有更有效的方法来迭代无限数量的xml元素吗？

怎么样

XPath应该很快。您可以将XPath调用的数量减少到一个：

doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
    print b.text

如果这还不够快，你可以试试。这样做的优点是不需要首先使用

etree.fromstring

处理整个XML，并且在访问子节点后会丢弃父节点。这两种方法都有助于降低内存需求。下面是关于删除不再需要的其他元素的更积极的方法

def fast_iter(context, func, *args, **kwargs):
    """
    fast_iter is useful if you need to free memory while iterating through a
    very large XML file.

    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

def process_element(elt):
    print(elt.text)

context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)

在解析大型XML文件时，阅读可能也很有用。根据这篇文章，使用

fast\u iter

的lxml可以比

celementree

的

iterparse

更快。（见表1）。

使用iterparse：

   import lxml.etree as ET
   for event, elem in ET.iterparse(filelike_object):
        if elem.tag == "a":
            process_a(elem)
            for child in elem:
                process_child(child)
            elem.clear() # destroy all child elements
        elif elem.tag != "b":
            elem.clear()

请注意，这并没有节省所有内存，但我已经能够使用此技术浏览超过Gb的XML流

尝试将xml.etree.cElementTree作为ET导入

。。。它与Python一起提供，它的iterparse
比lxml.etree
iterparse
更快，根据：
“”“对于需要大文件的高解析器吞吐量，并且很少或根本不进行序列化的应用程序，cET是最佳选择。也适用于从不适合内存的大型XML数据集中提取少量数据或聚合信息的iterparse应用程序。然而，如果涉及到往返性能，lxml总的速度往往要快上数倍。因此，只要输入的文档不比输出的文档大很多，lxml就是明显的赢家。”“
bs4对此非常有用
from bs4 import BeautifulSoup
raw_xml = open(source_file, 'r')
soup = BeautifulSoup(raw_xml)
soup.find_all('tags')

在fast\u iter code？？iterparse speed war中，doc=etree.fromstring（xml）的目的是什么：正如文章所述，如果选择一个特定的标记，lxml会更快，并且对于一般解析（需要检查多个标记），cElementTree更快。似乎不再是最新的：在具有8G ram的不同系统上处理有效、格式良好的10gig文件会导致python 3.7.2在读取7G文件后使系统崩溃。不是此解决方案，也不是基于iterparse（）的任何其他解决方案工作正常。首先，负载为20mb左右的ram时一切正常。然后，它会绊倒并使系统崩溃。请将“相当大”转换为兆字节。该链接已断开；下面是一个活动链接：
   import lxml.etree as ET
   for event, elem in ET.iterparse(filelike_object):
        if elem.tag == "a":
            process_a(elem)
            for child in elem:
                process_child(child)
            elem.clear() # destroy all child elements
        elif elem.tag != "b":
            elem.clear()

from bs4 import BeautifulSoup
raw_xml = open(source_file, 'r')
soup = BeautifulSoup(raw_xml)
soup.find_all('tags')