Python中的XML解析：如何获取与扁平字符串相关的子节点的字符串索引_Python_Python 3.x_Xml Parsing_Minidom

Python中的XML解析：如何获取与扁平字符串相关的子节点的字符串索引

python python-3.x

Python中的XML解析：如何获取与扁平字符串相关的子节点的字符串索引,python,python-3.x,xml-parsing,minidom,Python,Python 3.x,Xml Parsing,Minidom,我是Python中XML解析的新手，需要获取一些关于某些短语节点及其子节点的内部文本的数据（最好使用Minidom，但这不是必需的）例如： <phrase id="x.y">This example <foo id="x.y.z"> <bar type="likelihood" ref="x.y.z">might</bar> be u

我是Python中XML解析的新手，需要获取一些关于某些短语节点及其子节点的内部文本的数据（最好使用Minidom，但这不是必需的）

例如：

<phrase id="x.y">This example
    <foo id="x.y.z">
        <bar type="likelihood" ref="x.y.z">might</bar> 
    be useful</foo>.
</phrase>

这个例子
可以
要有用。

我想得到的是以下数据：

将父节点及其子节点组合在一起的字符串中的整个文本（就像Minidom文档中的递归方法
```
getText
```
一样）
包含子数据的三胞胎列表：
- 标签名
- 考虑整个字符串的开始索引
- 考虑整个字符串的结束索引

在xml示例中，

内部文本（可能）从索引14开始，在索引18结束，而

内容（可能有用）从索引19开始，在索引28结束。此示例的执行应该返回类似的结果（子级的顺序不重要）：

（“这个例子可能有用。”，[（'bar'，14,18），（'foo'，19,28）]

这是一个有趣的项目！有些复杂，不确定在其他情况下会走多远，但请尝试以下方法：

from lxml import etree
phrase = """[your xml above]"""
doc = etree.fromstring(phrase)

#this requires a couple of help functions to clean up spaces, find indexes, etc.:

def space_rem(str):
    while '  ' in str:
        str = str.replace('  ', ' ')
    return str

def build(str):
    str_path = doc.xpath(f'//{str}/text()')
    str = ''
    for s in str_path:
        str+=(s.strip())
    space_rem(str)
    str_ind = ttxt.find(str)
    return str_ind,str_ind+len(str)

foo_lst = ['foo']
bar_lst = ['bar']
ttxt = ''

for t in doc.xpath('//*/text()'):
    ttxt+=t.replace('\n','')
ttxt = space_rem(ttxt)

foo_lst.extend(build('foo'))
bar_lst.extend(build('bar'))

ttxt,foo_lst,bar_lst

输出：

('This example might be useful.', ['foo', 19, 28], ['bar', 13, 18])