Python 在解析时，`lxml`中是否有方法跳过不相关的分支_Python_Xml_Parsing_Lxml

Python 在解析时，`lxml`中是否有方法跳过不相关的分支

python xml parsing

Python 在解析时，`lxml`中是否有方法跳过不相关的分支,python,xml,parsing,lxml,Python,Xml,Parsing,Lxml,我试图从一个大XML文件（2G）中解析少量数据对于小文件，我只使用 root_node.xpath('/the/specific/path/i/am/interested/in') 但我已经读到，这是非常消耗内存对于大文件，建议实现目标解析器或使用etree.iterparse的方法。虽然内存消耗较少，但这些方法仍然在整个树上迭代 lxml是否也可以用于跳过对所有不相关分支的迭代，即避免解析器输入它们例如，当我 <the> <irrelevant_to_me>

我试图从一个大XML文件（2G）中解析少量数据

对于小文件，我只使用

root_node.xpath('/the/specific/path/i/am/interested/in')

但我已经读到，这是非常消耗内存

对于大文件，建议实现目标解析器或使用

etree.iterparse

的方法。虽然内存消耗较少，但这些方法仍然在整个树上迭代

lxml

是否也可以用于跳过对所有不相关分支的迭代，即避免解析器输入它们

例如，当我

<the>
  <irrelevant_to_me> 3000 lines, do not enter! </irrelevant_to_me>
  <specific>
    <irrelevant_to_me> 3000 lines, do not enter! </irrelevant_to_me>
    <path>
      <irrelevant_to_me> 3000 lines, do not enter! </irrelevant_to_me>
      <i>
         <irrelevant_to_me> 3000 lines, do not enter! </irrelevant_to_me>
         <am>
         <irrelevant_to_me> 3000 lines, do not enter! </irrelevant_to_me>
           <interested>
             <irrelevant_to_me> 3000 lines, do not enter! </irrelevant_to_me>
             <in>
               <!-- goal -->
             </in>
           </interested>
         </am>
      </i>
    </path>
  </specific>
</the>


3000行，不要进入！
3000行，不要进入！
3000行，不要进入！
3000行，不要进入！
3000行，不要进入！
3000行，不要进入！

解析器甚至不应该输入

-节点（不管它们的名称）。

您尝试运行时是否遇到了内存问题，或者您只是被警告吓倒了？你也没有提到，你是否真的需要修改那个巨大的文件；如果不是，从lxml切换到SAX将是一个选项。@guidot我没有验证它，但我非常确定。与实际内存负载无关，我对使用解析器的更具适应性的方法有教学兴趣。由于实际项目的依赖性限制，我强烈希望采用基于

lxml

的解决方案。