Xml 如何拆分包含混合内容的元素？_Xml_Xslt_Xslt 2.0

Xml 如何拆分包含混合内容的元素？

xml xslt

Xml 如何拆分包含混合内容的元素？,xml,xslt,xslt-2.0,Xml,Xslt,Xslt 2.0,这是我的XML文档的结构： <root> <txt>text here http may occur manyTM times.</txt> </root> 此处的文本http可能出现多次。处理后，应如下所示： <root> <txt>text here </txt>

这是我的XML文档的结构：

<root>
    <txt>text here http <b>may</b> occur <i>many<sup>TM</sup></i> times.</txt>
</root>


此处的文本http可能出现多次。

处理后，应如下所示：

<root>
    <txt>text here </txt>
    <url>http</url>
    <txt> <b>may</b> occur <i>many<sup>TM</sup></i> times.</txt>
</root>


此处文本
http
可能发生多次。

（为清晰起见，手动添加了换行符。）

以下模板“几乎”正确，但对于我注释掉的部分，它当然不正确：

<xsl:template match="txt/text()[contains(.,'http')]">
    <xsl:variable name="here" select="." />

    <xsl:analyze-string select="." regex="htt[^ ]+">

        <xsl:matching-substring>
                            <!-- this would solve all problems: 
                             let's just close the txt-element for a second ...
                <xsl:text></txt></xsl:text>
                            -->
            <xsl:element name="uri">
                <xsl:attribute name="href" select="." />
                <xsl:value-of select="."/>
            </xsl:element>
                            <!-- ... and open the txt-element again: nice!
                <xsl:text<txt></xsl:text>
                             -->    
        </xsl:matching-substring>

        <xsl:non-matching-substring>
                         <txt> <!-- not needed for the fake -->
            <xsl:copy-of select="."/>
                         </txt> <!-- dito -->
        </xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

相反，我使用额外的模板将txt的所有其他部分包装成txt元素，如下所示。结果也是有效的，但实际上不可用：

<xsl:template match="txt">
    <!-- only needed für the fake solution above:
            <xsl:copy> -->
        <xsl:apply-templates />
    <!-- </xsl:copy> -->
</xsl:template>

<xsl:template match="txt/text()[not(contains(.,'http'))]">
    <txt>
        <xsl:copy-of select="." />
    </txt>
</xsl:template>

<xsl:template match="txt/*" name="element_wrapper">
    <txt>
        <xsl:copy>
            <xsl:apply-templates />
        </xsl:copy>
    </txt>
</xsl:template>

结果是丑陋的，但却是有效的：

<root>
    <txt>text here </txt>
    <url>http</url>
    <txt> </txt>
    <txt><b>may</b></txt>
    <txt> occur </txt>
    <txt><i>many<sup>TM</sup></i></txt>
    <txt> times.</txt>
</root>


此处文本
http
也许
发生
曼尼特
时代。

（同样，我添加了换行符）

<> P>所有我看到的其他“解决方案”在元素边界上分割或只标记字符串，但是它们不在文本中间分裂。也许我的工作解决方案可以通过删除所有相邻的

重新格式化，但我不知道如何实现这一点。

我建议使用模式的两个步骤，在文本节点上使用

xsl:analyze string

将

http

或变体转换为

url

元素，然后对以=“url”开头的每个组使用

进行拆分：
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:template match="@* | node()" mode="#all">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="txt">
        <xsl:variable name="links">
            <xsl:copy>
                <xsl:apply-templates mode="insert-links"/>
            </xsl:copy>
        </xsl:variable>
        <xsl:apply-templates select="$links/node()" mode="extract-urls"/>
    </xsl:template>

    <xsl:template match="text()" mode="insert-links" priority="5">
        <xsl:analyze-string select="." regex="http[s]?">
            <xsl:matching-substring>
                <url href="{.}">
                    <xsl:value-of select="."/>
                </url>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>

    <xsl:template match="*[url]" mode="extract-urls">
        <xsl:for-each-group select="node()" group-starting-with="url">
            <xsl:choose>
                <xsl:when test="self::url">
                    <xsl:copy-of select="."/>
                    <xsl:element name="{name(..)}">
                        <xsl:apply-templates select="current-group() except ." mode="extract-urls"/>
                    </xsl:element>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:element name="{name(..)}">
                        <xsl:apply-templates select="current-group()"/>
                    </xsl:element>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:for-each-group>
    </xsl:template>

</xsl:stylesheet>



这将转换输入
<txt>text here http <b>may</b> occur <i>many<sup>TM</sup></i> times and https as well.</txt>

此处的文本http可能出现多次，https也可能出现多次。

输入到输出中
<txt>text here </txt><url href="http">http</url><txt> <b>may</b> occur <i>many<sup>TM</sup></i> times and </txt><url href="https">https</url><txt> as well.</txt>

此处的文本http可能出现多次，https也可能出现多次。
至于“我的工作解决方案可以通过删除所有相邻元素来重新格式化，但我不知道如何实现”，您当然可以将转换结果存储在变量中，然后只需使用
识别相邻的txt
元素并合并它们。您已经给出了一个输入文档，以及该文档的输出，但是您还没有说明任何一般规则来指示不同输入的输出应该是什么。我们知道您想要识别“http”和“https”而不是“ftp”的唯一方法是查看您的代码，但我们知道您的代码不正确，因此我们不能期望通过阅读来反向工程您的需求。我的问题更多的是关于txt元素的拆分。检测到的模式是任意的。马丁·霍宁正确地理解了这一点。一般来说，我试图在文本内容的某个位置拆分txt元素。在此位置，需要添加不同的“间歇”元素。