Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Xml 使用iterparse';丢失';儿童_Xml_Large Files_Children_Iterparse - Fatal编程技术网

Xml 使用iterparse';丢失';儿童

Xml 使用iterparse';丢失';儿童,xml,large-files,children,iterparse,Xml,Large Files,Children,Iterparse,我感谢您在以下方面的帮助:我需要读取一个大型XML文件并将其转换为CSV 我有两个函数可以做同样的事情,只有一个(函数1)使用iterparse(因为我需要处理大约2GB的文件),另一个不使用(函数2) Function2对于相同的XML文件(但最大为150MB)工作得非常好,在该大小之后,由于内存问题,它将失败 我的问题是,尽管代码(对于function1)没有给出错误,但它丢失了一些子项(这是一个巨大的问题!)。另一方面,函数2读取所有子函数,并且不会“松动”或使任何子函数失败 问:你能在f

我感谢您在以下方面的帮助:我需要读取一个大型XML文件并将其转换为CSV

我有两个函数可以做同样的事情,只有一个(函数1)使用iterparse(因为我需要处理大约2GB的文件),另一个不使用(函数2)

Function2对于相同的XML文件(但最大为150MB)工作得非常好,在该大小之后,由于内存问题,它将失败

我的问题是,尽管代码(对于function1)没有给出错误,但它丢失了一些子项(这是一个巨大的问题!)。另一方面,函数2读取所有子函数,并且不会“松动”或使任何子函数失败

问:你能在function1的代码中看到一些孩子会丢失(或者读不正确,或者被忽略)的原因吗

注1:我准备了一个50KB的XML样本,以备需要时发送。
注2:变量“nchil_count”仅用于计算子项的数量

代码(功能1):

def function1 ():
    # This function uses Iterparse
    # Doesn't give errors but looses some children. Why?
    # prints output to csv file, WCEL.csv

    from xml.etree.cElementTree import iterparse

    fname = "C:\Leonardo\Input data\Xml input data\NetactFiles\Netact_3g_rnc11_t1.xml"
    # ELEMENT_LIST = ["WCEL"]

    # Delete contents from exit file
    open("C:\Leonardo\Input data\Xml input data\WCEL.csv", 'w').close()

    # Open exit file
    with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:

        with open(fname) as xml_doc:
            context = iterparse(xml_doc, events=("start", "end"))
            context = iter(context)
            event, root = context.next()

            for event, elem in context:

                if event == "start" and elem.tag == "{raml20.xsd}managedObject":
                # if event == "start":
                    if elem.get('class') == 'WCEL':
                        print elem.attrib
                        # print elem.tag

                        element = elem.getchildren()
                        nchil_count = 0

                        for child in element:
                            if child.tag == "{raml20.xsd}p":
                                nchil_count = nchil_count + 1
                                # print child.tag
                                # print child.attrib
                                val = child.text
                                # print val
                                val = str (val)
                                exit_file.write(val + ",")

                        exit_file.write('\n')
                        print nchil_count

                elif event == "end" and elem.tag == "{raml20.xsd}managedObject":
                    # Clear Memory
                    root.clear()

    xml_doc.close()
    exit_file.close()

    return ()
def function2 (xmlFile):
    # Using Element Tree
    # Successful
    # Works well with files of 150 MB, like an XML (RAML) RNC export from Netact (1 RNC only)
    # It fails with huge files due to Memory

    import xml.etree.cElementTree as etree
    import shutil

    with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:

        # Populate the values per cell:

        tree = etree.parse(xmlFile)
        for value in tree.getiterator(tag='{raml20.xsd}managedObject'):
            if value.get('class') == 'WCEL':
                print value.attrib

                element = value.getchildren()
                nchil_count = 0

                for child in element:
                    if child.tag == "{raml20.xsd}p":
                        nchil_count = nchil_count + 1
                        # print child.tag
                        # print child.attrib
                        val = child.text
                        # print val

                        val = str (val)
                        exit_file.write(val + ",")

                exit_file.write('\n')
                print nchil_count

    exit_file.close() ## File closing after writing.

    return ()
代码(功能2):

def function1 ():
    # This function uses Iterparse
    # Doesn't give errors but looses some children. Why?
    # prints output to csv file, WCEL.csv

    from xml.etree.cElementTree import iterparse

    fname = "C:\Leonardo\Input data\Xml input data\NetactFiles\Netact_3g_rnc11_t1.xml"
    # ELEMENT_LIST = ["WCEL"]

    # Delete contents from exit file
    open("C:\Leonardo\Input data\Xml input data\WCEL.csv", 'w').close()

    # Open exit file
    with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:

        with open(fname) as xml_doc:
            context = iterparse(xml_doc, events=("start", "end"))
            context = iter(context)
            event, root = context.next()

            for event, elem in context:

                if event == "start" and elem.tag == "{raml20.xsd}managedObject":
                # if event == "start":
                    if elem.get('class') == 'WCEL':
                        print elem.attrib
                        # print elem.tag

                        element = elem.getchildren()
                        nchil_count = 0

                        for child in element:
                            if child.tag == "{raml20.xsd}p":
                                nchil_count = nchil_count + 1
                                # print child.tag
                                # print child.attrib
                                val = child.text
                                # print val
                                val = str (val)
                                exit_file.write(val + ",")

                        exit_file.write('\n')
                        print nchil_count

                elif event == "end" and elem.tag == "{raml20.xsd}managedObject":
                    # Clear Memory
                    root.clear()

    xml_doc.close()
    exit_file.close()

    return ()
def function2 (xmlFile):
    # Using Element Tree
    # Successful
    # Works well with files of 150 MB, like an XML (RAML) RNC export from Netact (1 RNC only)
    # It fails with huge files due to Memory

    import xml.etree.cElementTree as etree
    import shutil

    with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:

        # Populate the values per cell:

        tree = etree.parse(xmlFile)
        for value in tree.getiterator(tag='{raml20.xsd}managedObject'):
            if value.get('class') == 'WCEL':
                print value.attrib

                element = value.getchildren()
                nchil_count = 0

                for child in element:
                    if child.tag == "{raml20.xsd}p":
                        nchil_count = nchil_count + 1
                        # print child.tag
                        # print child.attrib
                        val = child.text
                        # print val

                        val = str (val)
                        exit_file.write(val + ",")

                exit_file.write('\n')
                print nchil_count

    exit_file.close() ## File closing after writing.

    return ()

我也有类似的问题。但也存在一些重要的差异:

  • 我使用了lxml.etree,而不是xml.etree(Windows的二进制版本'lxml-3.4.2-cp34-none-win32.whl'from)
  • 我对一个特定元素使用了iterparse,并且结束事件处于活动状态
  • 然后我使用xpath()方法深入研究这个元素
但结果是相同的:一些节点被忽略(丢失)。档案中没有任何东西可以解释原因。对于给定的文件-相同的节点。但是,当您仅进行了一次技术更改(使用xmllint格式化)时,其他节点将丢失

我重新组织了代码(没有xpath(),没有标记参数的iterparse,以及“开始”和“结束”事件,用element.tag属性值控制进程),发现有时(我不知道什么时候)进程“忘记”默认命名空间。我的意思是,在大多数情况下,element.tag的值是“{namespace uri}tag_name”,但在大约2%的情况下,只有“tag_name”。这就是xpath()找不到它们的原因

我知道文件中的所有内容都来自一个默认名称空间,因此我可以自己添加“{namespace uri}”,并正确处理文件

当主标记中显式声明了名称空间前缀并在所有其他标记中使用时,就没有问题了

这看起来像是在解析大型XML文件时出现的一个bug——如果在XML.etree中有相同的效果,那么在lxml中可能就没有了