python删除xml中的非标记_Python_Regex_Xml

python删除xml中的非标记

python regex xml

python删除xml中的非标记,python,regex,xml,Python,Regex,Xml,我想删除所有不在xml标记（清理）中的内容，并有选择地将其放入列表中。我得到了一些如下的xml： <tag>some text</tag> unwanted text <tag>some text</tag> 我只想扩展这里已经回答过的内容，因为我认为正确的方法是不使用regex来处理类似xml的内容。您应该使用XML解析器，不需要的内容称为尾部，您可以在解析时清理它，这是一种方法： import xml.etree.ElementTree as

我想删除所有不在xml标记（清理）中的内容，并有选择地将其放入列表中。我得到了一些如下的xml：

<tag>some text</tag> unwanted text <tag>some text</tag>

我只想扩展这里已经回答过的内容，因为我认为正确的方法是不使用regex来处理类似xml的内容。您应该使用XML解析器，不需要的内容称为尾部，您可以在解析时清理它，这是一种方法：

import xml.etree.ElementTree as ET

s = '''<root><tag>some text</tag> unwanted text <tag>some text</tag></root>'''

tree = ET.fromstring(s)

cleaned_tree = []

for node in tree:
    node.tail = ''
    cleaned_tree.append(ET.tostring(node))

print cleaned_tree # or print(cleaned_tree) if Python 3
['<tag>some text</tag>', '<tag>some text</tag>']

将xml.etree.ElementTree作为ET导入
s=''一些文本不需要的文本一些文本''
tree=ET.fromstring（s）
已清理的_树=[]
对于树中的节点：
node.tail=“”
已清理的树追加（ET.tostring（节点））
打印已清理的_树#或打印（已清理的_树）（如果是Python 3）
[“一些文本”，“一些文本”]

作为旁注：您可以查看您的str（cleanup），发现在我的示例中它缺少了root这样的标记。它失败fromstring（）可能暗示您的xml源有问题

是的，我会的，但我有个错误，我更新了我的问题啊，好的，我明白你的意思了want@fuubah，很高兴它有帮助。如果您的节点包含任何嵌套的标记，尽管您需要迭代并正确处理它们，但我相信tail属性是您要寻找的。好的，我所做的唯一更改是python的print（cleaned_tree）3@fuubah，酷，我从来没有问过你用的是什么版本的Python：）我已经相应地更新了

cleanup = re.findall(r"^<.>.*</.>$",  input)

import xml.etree.ElementTree as ET
root = ET.fromstring(str(cleanup))

import xml.etree.ElementTree as ET

s = '''<root><tag>some text</tag> unwanted text <tag>some text</tag></root>'''

tree = ET.fromstring(s)

cleaned_tree = []

for node in tree:
    node.tail = ''
    cleaned_tree.append(ET.tostring(node))

print cleaned_tree # or print(cleaned_tree) if Python 3
['<tag>some text</tag>', '<tag>some text</tag>']