Python 获取整个父标记'；元素树中的文本_Python_Xml_Elementtree

Python 获取整个父标记'；元素树中的文本

python xml

Python 获取整个父标记'；元素树中的文本,python,xml,elementtree,Python,Xml,Elementtree,在将xml.etree.ElementTree用作ETpython包时，我希望在包含一些子节点的xml标记中获取整个文本。考虑下面的XML： <p>This is the start of parent tag... <ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2

在将

xml.etree.ElementTree用作ET

python包时，我希望在包含一些子节点的xml标记中获取整个文本。考虑下面的XML：

<p>This is the start of parent tag...
        <ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 
</p>

这是父标记的开始。。。
儿童1。blah1 blah1 blah1 blah1儿童2 blah2 blah2 blah2

假设上面的XML位于

节点

，那么

节点.text

将只给我

这是父标记的开始…

。但是，我想捕获

标记内的所有文本（以及其子标记的文本），这将导致：

这是父标记的开始。。。儿童1。blah1 blah1 blah1儿童2 blah2 blah2 blah2

这个问题有解决办法吗？我查阅了文档，但没有找到真正可行的方法。

这确实是ElementTree的一个非常尴尬的特性。要点是：如果元素同时包含文本和子元素，并且子元素介于不同的中间文本节点之间，则子元素后面的文本称为该元素的

尾部

，而不是其

文本

为了收集作为元素的直接子元素或子元素的所有文本，您需要访问此元素以及所有子元素的

文本

和

尾部

>>> from lxml import etree

>>> s = '<p>This is the start of parent tag...<ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 </p>'

>>> root = etree.fromstring(s)
>>> child1, child2 = root.getchildren()

>>> root.text
'This is the start of parent tag...'

>>> child1.text, child1.tail
('child 1', '. blah1 blah1 blah1 ')

>>> child2.text, child2.tail
('child2', ' blah2 blah2 blah2 ')

您可以使用ElementTree执行类似的操作：

import xml.etree.ElementTree as ET
data = """[your string above]"""
tree = ET.fromstring(data)
print(' '.join(tree.itertext()).strip())

输出：

This is the start of parent tag...
         child 1 . blah1 blah1 blah1  child2  blah2 blah2 blah2

对，我甚至没有看到这是关于

xml.etree

：-）的+1.

This is the start of parent tag...
         child 1 . blah1 blah1 blah1  child2  blah2 blah2 blah2