Python 去掉一些标记并重命名它们

Python 去掉一些标记并重命名它们,python,Python,使用lxml库,有了这个docxml文件,我想去掉一些标记并重命名它们:doc.xml <html> <body> <h5>Fruits</h5> <div>This is some <span attr="foo">Text</span>.</div> <div>Some <span>more</span>

使用lxml库,有了这个docxml文件,我想去掉一些标记并重命名它们:doc.xml

<html>
    <body>
        <h5>Fruits</h5>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <h5>Vegetables</h5>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>
有人对此有什么想法吗?

  • 仅使用
    标记创建新文档
  • 迭代原始文档中
    标记的后代。
    • 将原始文档中的标记添加到新文档中-作为其
      标记的后代
      • 如果遇到
        标记;将
        标记添加到
        标记
        • 并将后续标记作为子体添加到它(代码)

    • 这里是一个使用lxml的xslt解决方案。它将处理卸载到libxml。我在转换样式表中添加了注释:

      from lxml import etree
      
      xsl = etree.XML('''
      <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
          <xsl:output method="xml" indent="yes" />
          <xsl:strip-space elements="*"/>
      
          <xsl:template match="/">
              <p>
                  <xsl:apply-templates select="html/body"/>
              </p>
          </xsl:template>
      
          <!-- match body, but do not add content; this excludes /html/body elements -->
          <xsl:template match="body">
              <xsl:apply-templates />
          </xsl:template>
      
          <xsl:template match="h5">
              <!-- record the current h5 title -->
              <xsl:variable name="title" select="."/>
              <h5>
                  <xsl:attribute name="title">
                      <xsl:value-of select="$title" />
                  </xsl:attribute>
      
                  <xsl:for-each select="following-sibling::div[preceding-sibling::h5[1] = $title]">
                      <!-- deep copy of each consecutive div following the current h5 element -->
                      <xsl:copy-of select="." />
                  </xsl:for-each>
              </h5>
          </xsl:template>
      
          <!-- match div, but do not output anything since we are copying it into the new h5 element -->
          <xsl:template match="div" />
      </xsl:stylesheet>
      ''')
      
      transform = etree.XSLT(xsl)
      with open("doc.xml") as f:
          print(transform(etree.parse(f)), end='')
      
      结果:

      <?xml version="1.0"?>
      <p>
        <h5 title="Fruits">
          <div>This is some <span attr="foo">Text</span>.</div>
          <div>Some <span>more</span> text.</div>
        </h5>
        <h5 title="Vegetables">
          <div>Yet another line <span attr="bar">of</span> text.</div>
          <div>This span will get <span attr="foo">removed</span> as well.</div>
          <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
          <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
        </h5>
      </p>
      
      
      
      这是一些文本。
      还有一些文字。
      还有一行文字。
      该跨度也将被移除。
      嵌套元素将被单独保留。
      除非他们也匹配。
      


      欢迎来到SO。请花点时间阅读和阅读该页面上的其他链接。这不是一个讨论论坛或教程服务。@wwii忘了问这个问题。你能提供一个小例子吗?因为你没有提供任何你有问题的代码,并且问如何做一些事情,我认为你在寻找一个算法。@Andrew-这回答了你的问题吗?
      from lxml import etree
      
      xsl = etree.XML('''
      <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
          <xsl:output method="xml" indent="yes" />
          <xsl:strip-space elements="*"/>
      
          <xsl:template match="/">
              <p>
                  <xsl:apply-templates select="html/body"/>
              </p>
          </xsl:template>
      
          <!-- match body, but do not add content; this excludes /html/body elements -->
          <xsl:template match="body">
              <xsl:apply-templates />
          </xsl:template>
      
          <xsl:template match="h5">
              <!-- record the current h5 title -->
              <xsl:variable name="title" select="."/>
              <h5>
                  <xsl:attribute name="title">
                      <xsl:value-of select="$title" />
                  </xsl:attribute>
      
                  <xsl:for-each select="following-sibling::div[preceding-sibling::h5[1] = $title]">
                      <!-- deep copy of each consecutive div following the current h5 element -->
                      <xsl:copy-of select="." />
                  </xsl:for-each>
              </h5>
          </xsl:template>
      
          <!-- match div, but do not output anything since we are copying it into the new h5 element -->
          <xsl:template match="div" />
      </xsl:stylesheet>
      ''')
      
      transform = etree.XSLT(xsl)
      with open("doc.xml") as f:
          print(transform(etree.parse(f)), end='')
      
      xsltproc doc.xsl doc.xml
      
      <?xml version="1.0"?>
      <p>
        <h5 title="Fruits">
          <div>This is some <span attr="foo">Text</span>.</div>
          <div>Some <span>more</span> text.</div>
        </h5>
        <h5 title="Vegetables">
          <div>Yet another line <span attr="bar">of</span> text.</div>
          <div>This span will get <span attr="foo">removed</span> as well.</div>
          <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
          <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
        </h5>
      </p>