使用XSLT选择包含HTML标记的n个单词的摘要_Html_Xslt_Parsing

使用XSLT选择包含HTML标记的n个单词的摘要

html xslt parsing

使用XSLT选择包含HTML标记的n个单词的摘要,html,xslt,parsing,Html,Xslt,Parsing,我想选择一个摘要，以及使用XSLT的HTML格式元素。以下是XML的一个示例： <PUBLDES>The <IT>European Journal of Cancer (including EJC Supplements),</IT> is an international comprehensive oncology journal that publishes original research, editorial comments, review

我想选择一个摘要，以及使用XSLT的HTML格式元素。以下是XML的一个示例：

<PUBLDES>The <IT>European Journal of Cancer (including EJC Supplements),</IT> 
is an international comprehensive oncology journal that publishes original 
research, editorial comments, review articles and news on experimental oncology, 
clinical oncology (medical, paediatric, radiation, surgical), translational 
oncology, and on cancer epidemiology and prevention. The Journal now has online
submission for authors. Please submit manuscripts at 
<SURL>http://ees.elsevier.com/ejc</SURL> and follow the instructions on the 
site.<P/>

The <IT>European Journal of Cancer (including EJC Supplements)</IT> is the 
official Journal of the European Organisation for Research and Treatment 
of Cancer (EORTC), the European CanCer Organisation (ECCO), the European 
Association for Cancer Research (EACR), the the European Society of Breast 
Cancer Specialists (EUSOMA) and the European School of Oncology (ESO). <P/>
Supplements to the <IT>European Journal of Cancer</IT> are published under 
the title <IT>EJC Supplements</IT> (ISSN 1359-6349).  All subscribers to 
<IT>European Journal of Cancer</IT> automatically receive this publication.<P/>
To access the latest tables of contents, abstracts and full-text articles 
from <IT>EJC</IT>, including Articles-in-Press, please visit <URL>
<HREF>http://www.sciencedirect.com/science/journal/09598049</HREF>
<HTXT>ScienceDirect</HTXT>
</URL>.</PUBLDES>

欧洲癌症杂志（包括EJC增补版），是一本国际综合性肿瘤学杂志，出版原始关于实验肿瘤学的研究、编辑评论、评论文章和新闻，临床肿瘤学（医学、儿科、放射、外科），转化肿瘤学，癌症流行病学和预防。《华尔街日报》现在已上线提交给作者。请于 http://ees.elsevier.com/ejc 并按照屏幕上的说明进行操作网站。《欧洲癌症杂志》（包括EJC增补版）是欧洲研究与治疗组织官方刊物癌症研究中心（EORTC）、欧洲癌症组织（ECCO）、欧洲欧洲乳腺癌学会癌症研究协会（EACR）癌症专家（EUSOMA）和欧洲肿瘤学院（ESO）《欧洲癌症杂志》的增刊在标题EJC补充文件（ISSN 1359-6349）。所有订阅《欧洲癌症杂志》自动接收本出版物。

访问最新的目录、摘要和全文文章来自EJC，包括新闻中的文章，请访问 http://www.sciencedirect.com/science/journal/09598049 科学指导 .

我如何从中获得45个单词，以及其中的HTML标记。当我使用

substring（）

或

concat（）

时，它会删除标记（如

等）。

您可能最好以编程方式执行此操作，而不是使用纯XSLT，但如果必须使用XSLT，这里有一种方法。它确实涉及多个样式表，尽管如果您能够使用扩展函数，您可以使用节点集，并将它们组合成一个大的（而且讨厌的）样式表

第一个样式表将复制初始XML，但将找到的任何文本“标记化”，以便文本中的每个单词都成为一个单独的“单词”元素

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <!-- Copy existing nodes and attributes -->
   <xsl:template match="@*|node()">
      <xsl:copy>
         <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
   </xsl:template>
   <!-- Match text nodes -->
   <xsl:template match="text()">
      <xsl:call-template name="tokenize">
         <xsl:with-param name="string" select="."/>
      </xsl:call-template>
   </xsl:template>
   <!-- Splits a string into separate elements for each word -->
   <xsl:template name="tokenize">
      <xsl:param name="string"/>
      <xsl:param name="delimiter" select="' '"/>
      <xsl:choose>
         <xsl:when test="$delimiter and contains($string, $delimiter)">
            <xsl:variable name="word" select="normalize-space(substring-before($string, $delimiter))"/>
            <xsl:if test="string-length($word) &gt; 0">
               <WORD>
                  <xsl:value-of select="$word"/>
               </WORD>
            </xsl:if>
            <xsl:call-template name="tokenize">
               <xsl:with-param name="string" select="substring-after($string, $delimiter)"/>
               <xsl:with-param name="delimiter" select="$delimiter"/>
            </xsl:call-template>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="word" select="normalize-space($string)"/>
            <xsl:if test="string-length($word) &gt; 0">
               <WORD>
                  <xsl:value-of select="$word"/>
               </WORD>
            </xsl:if>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
</xsl:stylesheet>

用于“标记”文本字符串的XSLT模板，我从这里的这个问题中得到：

（请注意，在XSLT2.0中，我相信有一个标记化函数，可以简化上述内容）

这将为您提供这样的XML

<PUBLDES>
   <WORD>The</WORD>
   <IT>
      <WORD>European</WORD>
      <WORD>Journal</WORD>
      <WORD>of</WORD>
      ....


这个
欧洲的
杂志
属于
....

等等

接下来，使用另一个XSLT文档遍历这个XML文档，只输出前45个单词的元素。为了做到这一点，我反复应用一个模板，保持当前找到的单词总数的运行。匹配节点时，有三种可能

匹配单词元素：输出它。如果未达到总数，则从下一个同级继续处理
匹配下面的字数小于总数的元素：复制整个元素，如果未达到总数，则从下一个同级继续处理
匹配以下字数将超过总数的元素：复制当前节点（但不是其子节点）并在第一个子节点处继续处理

这是样式表，有它所有的丑陋之处

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:variable name="WORDCOUNT">6</xsl:variable>  <xsl:template match="/"> <xsl:apply-templates select="descendant::*[1]" mode="word"> <xsl:with-param name="previousWords">0</xsl:with-param> </xsl:apply-templates> </xsl:template>  <xsl:template match="node()" mode="word"> <xsl:param name="previousWords"/>  <xsl:variable name="childWords" select="count(descendant::WORD)"/> <xsl:choose>  <xsl:when test="local-name(.) = 'WORD'">  <WORD> <xsl:value-of select="."/> </WORD>  <xsl:if test="$previousWords + 1 < $WORDCOUNT"> <xsl:apply-templates select="following-sibling::*[1]" mode="word"> <xsl:with-param name="previousWords"> <xsl:value-of select="$previousWords + 1"/> </xsl:with-param> </xsl:apply-templates> </xsl:if> </xsl:when>  <xsl:when test="$childWords <= $WORDCOUNT - $previousWords">  <xsl:copy>  <xsl:copy-of select="*|@*"/> </xsl:copy>  <xsl:if test="$previousWords + $childWords < $WORDCOUNT"> <xsl:apply-templates select="following-sibling::*[1]" mode="word"> <xsl:with-param name="previousWords"> <xsl:value-of select="$previousWords + $childWords"/> </xsl:with-param> </xsl:apply-templates> </xsl:if> </xsl:when>  <xsl:otherwise>  <xsl:copy>  <xsl:apply-templates select="descendant::*[1]" mode="word"> <xsl:with-param name="previousWords"> <xsl:value-of select="$previousWords"/> </xsl:with-param> </xsl:apply-templates> </xsl:copy> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet>

6. 0
如果您只输出前4个单词，比如说，这将为您提供以下输出

<PUBLDES> <WORD>The</WORD> <IT> <WORD>European</WORD> <WORD>Journal</WORD> <WORD>of</WORD> </IT> </PUBLDES>

这个欧洲的杂志属于
当然，您还需要另一个转换来删除单词元素，而只保留文本。这应该是相当直截了当的

虽然这一切都很糟糕，但这是我目前能想到的最好的办法
ru使用哪种语言？@zapping:标题上已经提到XSLT了！您能够使用XSLT2.0还是仅使用XSLT1.0？输出格式、文本、XML或HTML是什么？如果将标记视为单词，则选择任意长度可能会产生格式不正确的XML输出。