Python 去掉一些标记并重命名它们_Python

Python 去掉一些标记并重命名它们

python

Python 去掉一些标记并重命名它们,python,Python,使用lxml库，有了这个docxml文件，我想去掉一些标记并重命名它们：doc.xml <html> <body> <h5>Fruits</h5> <div>This is some <span attr="foo">Text</span>.</div> <div>Some <span>more</span>

使用lxml库，有了这个docxml文件，我想去掉一些标记并重命名它们：doc.xml

<html>
    <body>
        <h5>Fruits</h5>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <h5>Vegetables</h5>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>

有人对此有什么想法吗？

仅使用
标记创建新文档
迭代原始文档中
标记的后代。
- 将原始文档中的标记添加到新文档中-作为其
  标记的后代
  - 如果遇到
    标记；将
    标记添加到
    标记
    - 并将后续标记作为子体添加到它（代码）

from lxml import etree

xsl = etree.XML('''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="/">
        <p>
            <xsl:apply-templates select="html/body"/>
        </p>
    </xsl:template>

    <!-- match body, but do not add content; this excludes /html/body elements -->
    <xsl:template match="body">
        <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="h5">
        <!-- record the current h5 title -->
        <xsl:variable name="title" select="."/>
        <h5>
            <xsl:attribute name="title">
                <xsl:value-of select="$title" />
            </xsl:attribute>

            <xsl:for-each select="following-sibling::div[preceding-sibling::h5[1] = $title]">
                <!-- deep copy of each consecutive div following the current h5 element -->
                <xsl:copy-of select="." />
            </xsl:for-each>
        </h5>
    </xsl:template>

    <!-- match div, but do not output anything since we are copying it into the new h5 element -->
    <xsl:template match="div" />
</xsl:stylesheet>
''')

transform = etree.XSLT(xsl)
with open("doc.xml") as f:
    print(transform(etree.parse(f)), end='')

<?xml version="1.0"?>
<p>
  <h5 title="Fruits">
    <div>This is some <span attr="foo">Text</span>.</div>
    <div>Some <span>more</span> text.</div>
  </h5>
  <h5 title="Vegetables">
    <div>Yet another line <span attr="bar">of</span> text.</div>
    <div>This span will get <span attr="foo">removed</span> as well.</div>
    <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
    <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
  </h5>
</p>



这是一些文本。
还有一些文字。
还有一行文字。
该跨度也将被移除。
嵌套元素将被单独保留。
除非他们也匹配。

from lxml import etree

xsl = etree.XML('''
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="/">
        <p>
            <xsl:apply-templates select="html/body"/>
        </p>
    </xsl:template>

    <!-- match body, but do not add content; this excludes /html/body elements -->
    <xsl:template match="body">
        <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="h5">
        <!-- record the current h5 title -->
        <xsl:variable name="title" select="."/>
        <h5>
            <xsl:attribute name="title">
                <xsl:value-of select="$title" />
            </xsl:attribute>

            <xsl:for-each select="following-sibling::div[preceding-sibling::h5[1] = $title]">
                <!-- deep copy of each consecutive div following the current h5 element -->
                <xsl:copy-of select="." />
            </xsl:for-each>
        </h5>
    </xsl:template>

    <!-- match div, but do not output anything since we are copying it into the new h5 element -->
    <xsl:template match="div" />
</xsl:stylesheet>
''')

transform = etree.XSLT(xsl)
with open("doc.xml") as f:
    print(transform(etree.parse(f)), end='')

xsltproc doc.xsl doc.xml

<?xml version="1.0"?>
<p>
  <h5 title="Fruits">
    <div>This is some <span attr="foo">Text</span>.</div>
    <div>Some <span>more</span> text.</div>
  </h5>
  <h5 title="Vegetables">
    <div>Yet another line <span attr="bar">of</span> text.</div>
    <div>This span will get <span attr="foo">removed</span> as well.</div>
    <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
    <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
  </h5>
</p>