Xml 使用iterparse';丢失';儿童
我感谢您在以下方面的帮助:我需要读取一个大型XML文件并将其转换为CSV 我有两个函数可以做同样的事情,只有一个(函数1)使用iterparse(因为我需要处理大约2GB的文件),另一个不使用(函数2) Function2对于相同的XML文件(但最大为150MB)工作得非常好,在该大小之后,由于内存问题,它将失败 我的问题是,尽管代码(对于function1)没有给出错误,但它丢失了一些子项(这是一个巨大的问题!)。另一方面,函数2读取所有子函数,并且不会“松动”或使任何子函数失败 问:你能在function1的代码中看到一些孩子会丢失(或者读不正确,或者被忽略)的原因吗 注1:我准备了一个50KB的XML样本,以备需要时发送。Xml 使用iterparse';丢失';儿童,xml,large-files,children,iterparse,Xml,Large Files,Children,Iterparse,我感谢您在以下方面的帮助:我需要读取一个大型XML文件并将其转换为CSV 我有两个函数可以做同样的事情,只有一个(函数1)使用iterparse(因为我需要处理大约2GB的文件),另一个不使用(函数2) Function2对于相同的XML文件(但最大为150MB)工作得非常好,在该大小之后,由于内存问题,它将失败 我的问题是,尽管代码(对于function1)没有给出错误,但它丢失了一些子项(这是一个巨大的问题!)。另一方面,函数2读取所有子函数,并且不会“松动”或使任何子函数失败 问:你能在f
注2:变量“nchil_count”仅用于计算子项的数量 代码(功能1):
def function1 ():
# This function uses Iterparse
# Doesn't give errors but looses some children. Why?
# prints output to csv file, WCEL.csv
from xml.etree.cElementTree import iterparse
fname = "C:\Leonardo\Input data\Xml input data\NetactFiles\Netact_3g_rnc11_t1.xml"
# ELEMENT_LIST = ["WCEL"]
# Delete contents from exit file
open("C:\Leonardo\Input data\Xml input data\WCEL.csv", 'w').close()
# Open exit file
with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:
with open(fname) as xml_doc:
context = iterparse(xml_doc, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "start" and elem.tag == "{raml20.xsd}managedObject":
# if event == "start":
if elem.get('class') == 'WCEL':
print elem.attrib
# print elem.tag
element = elem.getchildren()
nchil_count = 0
for child in element:
if child.tag == "{raml20.xsd}p":
nchil_count = nchil_count + 1
# print child.tag
# print child.attrib
val = child.text
# print val
val = str (val)
exit_file.write(val + ",")
exit_file.write('\n')
print nchil_count
elif event == "end" and elem.tag == "{raml20.xsd}managedObject":
# Clear Memory
root.clear()
xml_doc.close()
exit_file.close()
return ()
def function2 (xmlFile):
# Using Element Tree
# Successful
# Works well with files of 150 MB, like an XML (RAML) RNC export from Netact (1 RNC only)
# It fails with huge files due to Memory
import xml.etree.cElementTree as etree
import shutil
with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:
# Populate the values per cell:
tree = etree.parse(xmlFile)
for value in tree.getiterator(tag='{raml20.xsd}managedObject'):
if value.get('class') == 'WCEL':
print value.attrib
element = value.getchildren()
nchil_count = 0
for child in element:
if child.tag == "{raml20.xsd}p":
nchil_count = nchil_count + 1
# print child.tag
# print child.attrib
val = child.text
# print val
val = str (val)
exit_file.write(val + ",")
exit_file.write('\n')
print nchil_count
exit_file.close() ## File closing after writing.
return ()
代码(功能2):
def function1 ():
# This function uses Iterparse
# Doesn't give errors but looses some children. Why?
# prints output to csv file, WCEL.csv
from xml.etree.cElementTree import iterparse
fname = "C:\Leonardo\Input data\Xml input data\NetactFiles\Netact_3g_rnc11_t1.xml"
# ELEMENT_LIST = ["WCEL"]
# Delete contents from exit file
open("C:\Leonardo\Input data\Xml input data\WCEL.csv", 'w').close()
# Open exit file
with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:
with open(fname) as xml_doc:
context = iterparse(xml_doc, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == "start" and elem.tag == "{raml20.xsd}managedObject":
# if event == "start":
if elem.get('class') == 'WCEL':
print elem.attrib
# print elem.tag
element = elem.getchildren()
nchil_count = 0
for child in element:
if child.tag == "{raml20.xsd}p":
nchil_count = nchil_count + 1
# print child.tag
# print child.attrib
val = child.text
# print val
val = str (val)
exit_file.write(val + ",")
exit_file.write('\n')
print nchil_count
elif event == "end" and elem.tag == "{raml20.xsd}managedObject":
# Clear Memory
root.clear()
xml_doc.close()
exit_file.close()
return ()
def function2 (xmlFile):
# Using Element Tree
# Successful
# Works well with files of 150 MB, like an XML (RAML) RNC export from Netact (1 RNC only)
# It fails with huge files due to Memory
import xml.etree.cElementTree as etree
import shutil
with open("C:\Leonardo\Input data\Xml input data\WCEL.csv", "a") as exit_file:
# Populate the values per cell:
tree = etree.parse(xmlFile)
for value in tree.getiterator(tag='{raml20.xsd}managedObject'):
if value.get('class') == 'WCEL':
print value.attrib
element = value.getchildren()
nchil_count = 0
for child in element:
if child.tag == "{raml20.xsd}p":
nchil_count = nchil_count + 1
# print child.tag
# print child.attrib
val = child.text
# print val
val = str (val)
exit_file.write(val + ",")
exit_file.write('\n')
print nchil_count
exit_file.close() ## File closing after writing.
return ()
我也有类似的问题。但也存在一些重要的差异:
- 我使用了lxml.etree,而不是xml.etree(Windows的二进制版本'lxml-3.4.2-cp34-none-win32.whl'from)
- 我对一个特定元素使用了iterparse,并且结束事件处于活动状态
- 然后我使用xpath()方法深入研究这个元素