R 如何使用XML2包解析XML文件的子路径_R_Xml_Tidyverse

R 如何使用XML2包解析XML文件的子路径

r xml

R 如何使用XML2包解析XML文件的子路径,r,xml,tidyverse,R,Xml,Tidyverse,我有如下内容，看起来像这样，我需要使用但是，使用此代码，我无法获取subcellularLocationxpath下的列表： library(xml2) xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml" doc <- xmlfile %>% xml2::read_xml() xml_name(doc) xml_children(doc) x <- xml_find_all(doc, "//subce

我有如下内容，看起来像这样，我需要使用

但是，使用此代码，我无法获取

subcellularLocation

xpath下的列表：

library(xml2)
xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml"

doc <- xmlfile %>%
  xml2::read_xml()

xml_name(doc)
xml_children(doc)
x <- xml_find_all(doc, "//subcellularLocation")
xml_path(x)
# character(0)

如果您不介意，可以使用

rvest

软件包：

library(rvest)
a=read_html(xmlfile)%>%
   html_nodes("subcellularlocation")

a%>%html_children()%>%html_text()

[1] "Nucleus"                                              "Chromosome"                                          
[3] "Cytoplasm"                                            "Secreted"                                            
[5] "Cell membrane"                                        "Peripheral membrane protein"                         
[7] "Extracellular side"                                   "Endosome"                                            
[9] "Endoplasmic reticulum-Golgi intermediate compartment"

使用

x非常感谢。如何获取subcellularlocation节点下的子节点（即位置）的内容？（例如细胞核、染色体、细胞质等）@scamander您希望的输出是什么？你能用期望的结果更新你的问题吗？举个小例子，应该能让你对自己的想法有所了解expect@Scamander，此解决方案是否仍不能回答您的问题？但不完全在那里（例如，细胞膜等，应分开）请参阅我的答案（仅使用xml2）。如果答案解决了您的问题，请单击“接受”标记，以便其他人知道此问题已解决
library(rvest)
a=read_html(xmlfile)%>%
   html_nodes("subcellularlocation")

a%>%html_children()%>%html_text()

[1] "Nucleus"                                              "Chromosome"                                          
[3] "Cytoplasm"                                            "Secreted"                                            
[5] "Cell membrane"                                        "Peripheral membrane protein"                         
[7] "Extracellular side"                                   "Endosome"                                            
[9] "Endoplasmic reticulum-Golgi intermediate compartment"

# Namespaces ---------------------------------------------------------------
# If the document uses namespaces, you'll need use xml_ns to form
# a unique mapping between full namespace url and a short prefix
x <- read_xml('
 <root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com">
   <f:doc><g:baz /></f:doc>
   <f:doc><g:baz /></f:doc>
 </root>
')
xml_find_all(x, ".//f:doc")
xml_find_all(x, ".//f:doc", xml_ns(x))

d1  <-> http://uniprot.org/uniprot
xsi <-> http://www.w3.org/2001/XMLSchema-instance

xml_find_all(doc, "//d1:subcellularLocation")
   %>% xml_children()
   %>% xml_text()

## [1] "Nucleus"                                             
## [2] "Chromosome"                                          
## [3] "Cytoplasm"                                           
## [4] "Secreted"                                            
## [5] "Cell membrane"                                       
## [6] "Peripheral membrane protein"                         
## [7] "Extracellular side"                                  
## [8] "Endosome"                                            
## [9] "Endoplasmic reticulum-Golgi intermediate compartment"ent"