Python 如何删除xml的某些节点?

Python 如何删除xml的某些节点?,python,xml,awk,Python,Xml,Awk,有一个XML文件包含如下内容 <node1> bla <remove> abc </remove> kkk </node1> $ awk '/<node1>/{gsub(/<[/]?remove>/," ")} {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | gre

有一个XML文件包含如下内容

    <node1>
      bla
      <remove>
        abc
      </remove>
        kkk
    </node1>
$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 

布拉
abc
kkk
我需要删除node1下的节点,但是像
这样的节点也包含
,它们不应该被删除,我想知道怎么做,可能是awk脚本或Python之类的

$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 
输出应该是

   <node1>
      bla
        abc
        kkk
    </node1>
$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 

布拉
abc
kkk

使用以下输入:

$ cat file
<node1>
   bla
   <remove>
     abc
   </remove>
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9>
$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 
如果在一行中找不到标记,脚本甚至会执行此操作:

$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 
$ cat file
<node1>bla<remove>abc</remove>kkk</node1>
<node9>bla<remove>abc</remove>kkk</node9>

$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file 
<node1>bla abc kkk</node1>
<node9>bla<remove>abc</remove>kkk</node9>
$cat文件
布拉布克克
布拉布克克
$awk'/{gsub(//,“”)}
{printf“%s%s”,$0,RT}'RS=''文件
bla abc kkk
布拉布克克

您应该知道,使用文本处理来修改xml有风险。如果你必须这样做,这个sed一行应该适用于你的例子和sudo答案中的例子:

$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 
sed '/node1>/,/node1>/{/remove>/d}' file

另一个
awk

$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 
awk '/node1>/,/\/node1>/ {if ($0~/remove>/) $0=""} NF'

我建议使用
xml
解析器。一个好的例子是
beautifulsou

$ awk '/<node1>/{gsub(/<[/]?remove>/," ")}
       {printf "%s%s",$0,RT}' RS='</node[0-9]+>' file | grep '\S'
<node1>
   bla
     abc
   kkk
</node1>
<node9>
   bla
   <remove>
     abc
   </remove>
   kkk
</node9> 
from bs4 import BeautifulSoup
import sys

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')

for elem in soup.node1.children:
    if elem.name == 'remove':
        elem.decompose()

print(soup)

+1,如果标签如前所述位于单行上,这是一种很好的方法。您应该指出,这只适用于
gwak
,因为
RS
不止一个字符。@Jotne是的,我原来的意思是,一定是分心了。