Python XML元素树解析一个大文档,返回一个子集
我有一个很大的德语文本xml元素文档,根iter只返回文档的一个子集 root.iter('tu')只找到82Python XML元素树解析一个大文档,返回一个子集,python,xml,Python,Xml,我有一个很大的德语文本xml元素文档,根iter只返回文档的一个子集 root.iter('tu')只找到82 import logging import xml.etree.cElementTree as ET class Extractor(object): def _get_iter(self, filename: str): with open(filename) as objects: context = ET.iterparse(obj
import logging
import xml.etree.cElementTree as ET
class Extractor(object):
def _get_iter(self, filename: str):
with open(filename) as objects:
context = ET.iterparse(objects, events=("start", "end"))
index, (event, root) = next(enumerate(context))
return root.iter('tu')
def get_objects(self, filename: str, limit=-1):
found = sum(1 for _ in self._get_iter(filename))
logging.getLogger(__name__).info('found: {}'.format(found))
// found is 82, actual number is millions
alignments = extractor.get_alignments('data/file.tmx', 100000)
更新:示例tmx文件:
更新:使用event和tagname=tu解决了这个问题,我想这是root.iter()的错误行为。root.iter('tagname')的行为令人费解,它不像预期的迭代器那样工作,显然是在准备文档
解决办法是
class Extractor(object):
def get_objects(self, filename: str):
# get an iterable
context = ET.iterparse(filename, events=("start", "end"))
# turn it into an iterator
context = iter(context)
for event, elem in context:
if event == "end" and elem.tag == "tu":
# do something with elem
elem.clear() # clears memory after doing something with the data
我想你这里缺少一些代码<代码>\u get\u iter()似乎是一个类的方法。我可以共享子集,但不知道如何共享文件。请查看帖子上的最新更新,我有一个文件的pastbin链接,它是TMX,它是XML。如果你已经解决了问题,请发布答案。