Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python XML元素树解析一个大文档,返回一个子集_Python_Xml - Fatal编程技术网

Python XML元素树解析一个大文档,返回一个子集

Python XML元素树解析一个大文档,返回一个子集,python,xml,Python,Xml,我有一个很大的德语文本xml元素文档,根iter只返回文档的一个子集 root.iter('tu')只找到82 import logging import xml.etree.cElementTree as ET class Extractor(object): def _get_iter(self, filename: str): with open(filename) as objects: context = ET.iterparse(obj

我有一个很大的德语文本xml元素文档,根iter只返回文档的一个子集

root.iter('tu')只找到82

import logging
import xml.etree.cElementTree as ET
class Extractor(object):
    def _get_iter(self, filename: str):
        with open(filename) as objects:
            context = ET.iterparse(objects, events=("start", "end"))

            index, (event, root) = next(enumerate(context))

            return root.iter('tu')

    def get_objects(self, filename: str, limit=-1):
        found = sum(1 for _ in self._get_iter(filename))
        logging.getLogger(__name__).info('found: {}'.format(found))

// found is 82, actual number is millions

alignments = extractor.get_alignments('data/file.tmx', 100000)
更新:示例tmx文件:

更新:使用event和tagname=tu解决了这个问题,我想这是root.iter()的错误行为。

root.iter('tagname')的行为令人费解,它不像预期的迭代器那样工作,显然是在准备文档

解决办法是

class Extractor(object):
    def get_objects(self, filename: str):

        # get an iterable
        context = ET.iterparse(filename, events=("start", "end"))

        # turn it into an iterator
        context = iter(context)

        for event, elem in context:
            if event == "end" and elem.tag == "tu":
               # do something with elem
               elem.clear() # clears memory after doing something with the data


我想你这里缺少一些代码<代码>\u get\u iter()似乎是一个类的方法。我可以共享子集,但不知道如何共享文件。请查看帖子上的最新更新,我有一个文件的pastbin链接,它是TMX,它是XML。如果你已经解决了问题,请发布答案。