lmxl在python中解析大型数据集_Python_Xml_Parsing_Memory_Lxml

lmxl在python中解析大型数据集

python xml parsing memory

lmxl在python中解析大型数据集,python,xml,parsing,memory,lxml,Python,Xml,Parsing,Memory,Lxml,我试图用Python解析一个大的xml数据集（2.15gig），但一直遇到问题。我开始试着正常解析它，但遇到了内存错误，所以我试着用谷歌搜索，最终找到了我正在使用atm的lmxl.iterparse（）。问题是，如果我试图解析整个文件，仍然会出现内存错误。Iterparse允许我在遇到内存错误之前结束解析（通过添加计数器），但我希望解析整个文件。这是我的密码： from lxml import etree all_sentences = [] arguments = file.

我试图用Python解析一个大的xml数据集（2.15gig），但一直遇到问题。我开始试着正常解析它，但遇到了内存错误，所以我试着用谷歌搜索，最终找到了我正在使用atm的lmxl.iterparse（）。问题是，如果我试图解析整个文件，仍然会出现内存错误。Iterparse允许我在遇到内存错误之前结束解析（通过添加计数器），但我希望解析整个文件。这是我的密码：

from lxml import etree

    all_sentences = []
    arguments = file.split(".")

    if arguments[-1] == "xml":
        tree = etree.iterparse(file, tag="sentence")

        counter = 0

        for event, elem in tree:
            counter += 1
            was_sentence = False
            manipulated_sentence = []
            child_index = 0

            for child in elem:
                if child.tag == "w":
                    was_sentence = True
                    child_index += 1

                    head_rel = str(child.get("deprel"))
                    if head_rel == "ROOT":
                        head = 0
                    else:
                        try:
                            head = int(str(child.get("dephead")))
                        except:
                            break

                    manipulated_sentence.append([child_index, str(child.text), str(child.get("pos")), head])

            if was_sentence:
                all_sentences.append(manipulated_sentence.copy())

            if counter > 44000:
                break

这是如何在下游使用的？因为看起来您正在制作一个巨大的列表，如果您有内存问题，自然要做的事情是将代码转换为生成器。