Xml 使用XPath查找连续同级_Xml_Xpath_Nokogiri

Xml 使用XPath查找连续同级

xml xpath

Xml 使用XPath查找连续同级,xml,xpath,nokogiri,Xml,Xpath,Nokogiri,对于XPath专家来说，这里有一个简单的要点！：）文件结构： <tokens> <token> <word>Newt</word><entityType>PROPER_NOUN</entityType> </token> <token> <word>Gingrich</word><entityType>PROPER_NOUN</e

对于XPath专家来说，这里有一个简单的要点！：）

文件结构：

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

。。。它可以找到两个连续的专有名词标记中的第二个，但我不知道如何让它发出第一个标记

一些注意事项：

我不介意对节点集进行更高级别的处理（例如在Ruby/Nokogiri中），如果这样可以简化问题的话
如果有三个或更多连续的专有名词标记（称它们为A、B、C），理想情况下我希望发出[A、B]、[B、C]

更新下面是我使用高级Ruby函数的解决方案。但是我已经厌倦了所有那些XPath恶霸在我脸上踢沙子，我想知道真正的XPath程序员是如何做到这一点的

def extract(doc)
  names = []
  sentences = doc.xpath("//tokens")
  sentences.each do |sentence| 
    tokens = sentence.xpath("token")
    prev = nil
    tokens.each do |token|
      name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
      names << [prev, name] if (name && prev)
      prev = name
    end
  end
  names
end

def提取（doc）
名称=[]
语句=doc.xpath（//标记）
每句话
tokens=句子.xpath（“token”）
上一个=零
代币。每个do |代币|
name=token.xpath（“word”）.text if-token.xpath（“entityType”）.text==“专有名词”
name我将分两步完成这项工作。第一步是选择一组节点：
//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]

这将为您提供启动2字对的所有标记。然后，为了获得实际的对，在节点列表上迭代并提取/word
和以下兄弟：：令牌[1]/word

使用XmlStarlet（用于快速xml操作的工具）命令行
xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml 

给予
Newt,Gingrich
Garry,Trudeau

XmlStarlet还将把该命令行编译为xslt，相关位为
  <xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
    <xsl:value-of select="word"/>
    <xsl:value-of select="','"/>
    <xsl:value-of select="following-sibling::token[1]/word"/>
    <xsl:value-of select="'&#10;'"/>
  </xsl:for-each>



使用Nokogiri，它可能看起来像：
#解析文档
doc=Nokogiri:：XML（文档字符串）
#选择启动2字对的所有标记
pair\u start=doc.xpath'//token[entityType=“property\u noon”和以下同级：：token[1][entityType=“property\u noon”]'
#提取每个单词和下面的单词
结果=配对开始。每个带有_对象（[]）的|节点、数组|
数组XPath返回节点或节点集，但不返回组。所以你必须确定每个小组的开始，然后抓住剩下的
first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word"
next = "../following-sibling::token[1]/word"

doc.xpath(first).map{|word| [word.text, word.xpath(next).text] }

输出：
[["Newt", "Gingrich"], ["Garry", "Trudeau"]]

此XPath 1.0表达式：
   /*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word

/*/token
  [entityType='PROPER_NOUN'
 and
   preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
  ]
   /word

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
==============
  <xsl:copy-of select=
   "/*/token
      [entityType='PROPER_NOUN'
     and
       preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
 </xsl:template>
</xsl:stylesheet>

<word>Newt</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Trudeau</word>

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

选择所有“成对第一名词词”
此XPath表达式：
   /*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word

/*/token
  [entityType='PROPER_NOUN'
 and
   preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
  ]
   /word

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
==============
  <xsl:copy-of select=
   "/*/token
      [entityType='PROPER_NOUN'
     and
       preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
 </xsl:template>
</xsl:stylesheet>

<word>Newt</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Trudeau</word>

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

选择所有“第二对名词词”
您必须以生成的两个结果节点集的第k个节点生成实际对
基于XSLT的验证：
   /*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word

/*/token
  [entityType='PROPER_NOUN'
 and
   preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
  ]
   /word

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
==============
  <xsl:copy-of select=
   "/*/token
      [entityType='PROPER_NOUN'
     and
       preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
 </xsl:template>
</xsl:stylesheet>

<word>Newt</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Trudeau</word>

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

及
当对该XML文档应用相同的转换时（注意，我们现在有三个示例）：
及
注意事项：
   /*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word

/*/token
  [entityType='PROPER_NOUN'
 and
   preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
  ]
   /word

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
==============
  <xsl:copy-of select=
   "/*/token
      [entityType='PROPER_NOUN'
     and
       preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
 </xsl:template>
</xsl:stylesheet>

<word>Newt</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Trudeau</word>

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

可以使用单个XPath 2.0表达式生成所需的结果。如果您对XPath 2.0解决方案感兴趣，请务必告诉我。
仅XPath一项功能不足以完成此任务。但是XSLT非常简单：
<xsl:for-each-group select="token" group-adjacent="entityType">
  <xsl:if test="current-grouping-key="PROPER_NOUN">
     <xsl:copy-of select="current-group">
     <xsl:text>====</xsl:text>
  <xsl:if>
</xsl:for-each-group>


我得出的答案与您的答案几乎相同，因此我在这个答案中添加了一个使用Nokogiri的示例，而不是添加另一个答案。