R 基于相关节点的属性和文本值解析XML

R 基于相关节点的属性和文本值解析XML,r,xml-parsing,R,Xml Parsing,我以前使用过XML包来解析HTML和XML,并且对xPath有了初步的了解。然而,我被要求考虑XML数据,其中重要的比特是由元素本身和文本的属性以及相关节点的组合决定的。我从来没有这样做过。比如说 [更新的示例,稍微扩展] <Catalogue> <Bookstore id="ID910705541"> <location>foo bar</location> <books> <book category="A"

我以前使用过XML包来解析HTML和XML,并且对xPath有了初步的了解。然而,我被要求考虑XML数据,其中重要的比特是由元素本身和文本的属性以及相关节点的组合决定的。我从来没有这样做过。比如说

[更新的示例,稍微扩展]

<Catalogue>
<Bookstore id="ID910705541">
  <location>foo bar</location>
  <books>
    <book category="A" id="1">
        <title>Alpha</title>
        <author ref="1">Matthew</author>
        <author>Mark</author>
        <author>Luke</author>
        <author ref="2">John</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Beta</title>
        <author ref="1">Huey</author>
        <author>Duey</author>
        <author>Louie</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Gamma</title>
        <author ref="1">Tweedle Dee</author>
        <author ref="2">Tweedle Dum</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
  </Bookstore> 
<Bookstore id="ID910700051">
  <location>foo</location>
  <books>
    <book category="A" id="1">
        <title>Happy</title>
        <author>Dopey</author>
        <author>Bashful</author>
        <author>Doc</author>
        <author ref="1">Grumpy</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Ni</title>
        <author ref="1">John</author>
        <author ref="2">Paul</author>
        <author ref="3">George</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>San</title>
        <author ref="1">Ringo</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
<Bookstore id="ID910715717">
    <location>bar</location>
  <books>
    <book category="A" id="1">
        <title>Un</title>
        <author ref="1">Winkin</author>
        <author>Blinkin</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Deux</title>
        <author>Nod</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Trois</title>
        <author>Manny</author>
        <author>Moe</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
</Catalogue>
如果另一家书店的位置包含“NY”,则它将包含第二行,以此类推

在这些复杂的条件下,我对解析器的要求是否过高

require(XML)

xdata <- xmlParse(apptext)
xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')
#[[1]]
#<author>Jane Smith</author> 

#[[2]]
#<author>John Doe</author> 

#[[3]]
#<author>Karl Pearson</author> 

#[[4]]
#<author>William Gosset</author> 
获取这些节点的兄弟节点

/following-sibling::books
从这些注释中获取所有不带ref属性的作者

/.//author[not(@ref)]
如果需要文本,请使用xmlValue:

> xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
[1] "Jane Smith"     "John Doe"       "Karl Pearson"   "William Gosset"
更新:

child.nodes <- xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
    xpathSApply(x,'.//ancestor::bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
# id  id  id  id 
#"1" "1" "1" "1" 

哦,伙计,那太好了,我已经根据我的数据调整了它。我得到了一个作者姓名向量,然后我可以连接它。问题是,我只需要连接那些与各自的bookstore ID关联的对象。我想我可以将xPath返回到祖先::bookstore,然后将@ID返回到另一个向量。但这样做会为每个书店返回一个ID,而不是为每个符合条件的书籍返回一个ID。我期待后者,它将返回一个与包含作者姓名的向量长度相同的向量。有什么建议吗?没有具体的例子就有点难以评论。您可以将满足您条件的每个节点带回其祖先(书店)并检索id。我已经给出了一个示例,它可能会处理您的完整数据;这个问题可能与新数据有点不同步,所以我希望这不是症结所在。太棒了!处理我提出的差异是锦上添花。非常感谢!
> xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
[1] "Jane Smith"     "John Doe"       "Karl Pearson"   "William Gosset"
child.nodes <- xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
    xpathSApply(x,'.//ancestor::bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
# id  id  id  id 
#"1" "1" "1" "1" 
xdata <- '<Catalogue>
<Bookstore id="ID910705541">
  <location>foo bar</location>
  <books>
    <book category="A" id="1">
        <title>Alpha</title>
        <author ref="1">Matthew</author>
        <author>Mark</author>
        <author>Luke</author>
        <author ref="2">John</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Beta</title>
        <author ref="1">Huey</author>
        <author>Duey</author>
        <author>Louie</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Gamma</title>
        <author ref="1">Tweedle Dee</author>
        <author ref="2">Tweedle Dum</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
  </Bookstore> 
<Bookstore id="ID910700051">
  <location>foo</location>
  <books>
    <book category="A" id="1">
        <title>Happy</title>
        <author>Dopey</author>
        <author>Bashful</author>
        <author>Doc</author>
        <author ref="1">Grumpy</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Ni</title>
        <author ref="1">John</author>
        <author ref="2">Paul</author>
        <author ref="3">George</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>San</title>
        <author ref="1">Ringo</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
<Bookstore id="ID910715717">
    <location>bar</location>
  <books>
    <book category="A" id="1">
        <title>Un</title>
        <author ref="1">Winkin</author>
        <author>Blinkin</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Deux</title>
        <author>Nod</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Trois</title>
        <author>Manny</author>
        <author>Moe</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
</Catalogue>'
require(XML)
xdata <- xmlParse(xdata)
child.nodes <- getNodeSet(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
  xpathSApply(x,'.//ancestor::Bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
#           id            id            id            id            id 
#"ID910705541" "ID910705541" "ID910705541" "ID910705541" "ID910700051" 
#           id            id 
#"ID910700051" "ID910700051"

xpathSApply(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
# [1] "Mark"    "Luke"    "Duey"    "Louie"   "Dopey"   "Bashful" "Doc"