Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/67.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Xml 刮取分层数据_Xml_R_Xpath_Xml Parsing_Web Scraping - Fatal编程技术网

Xml 刮取分层数据

Xml 刮取分层数据,xml,r,xpath,xml-parsing,web-scraping,Xml,R,Xpath,Xml Parsing,Web Scraping,我正试图从你的网站上搜出大陆/国家的百货公司名单。我运行以下代码首先获取大陆,因为我们可以看到XML层次结构的方式是,拥有每个大陆的国家不是该大陆的子节点 > url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country" > doc = htmlTreeParse(url, useInternalNodes = T) > nodeNames = getNodeSet(doc, "//h2/

我正试图从你的网站上搜出大陆/国家的百货公司名单。我运行以下代码首先获取大陆,因为我们可以看到XML层次结构的方式是,拥有每个大陆的国家不是该大陆的子节点

> url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
> doc = htmlTreeParse(url, useInternalNodes = T)
> nodeNames = getNodeSet(doc, "//h2/span[@class='mw-headline']")
> # For Africa
> xmlChildren(nodeNames[[1]])
$a
<a href="/wiki/Africa" title="Africa">Africa</a> 

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"        
> xmlSize(nodeNames[[1]])
[1] 1
url doc=htmlTreeParse(url,useInternalNodes=T) >nodeNames=getNodeSet(doc,“//h2/span[@class='mw-headline']”) >#非洲 >xmlChildren(节点名称[[1]]) 一美元 属性(,“类”) [1] XMLInternalNodeList“XMLNodeList” >xmlSize(节点名称[[1]]) [1] 1
我知道我可以用一个单独的getNodeSet命令处理国家,但我只是想确保我没有遗漏什么。有没有一种更聪明的方法可以一次获取每个大陆上的所有数据,然后一次获取每个国家的所有数据?

使用xpath,可以将多条路径与|分隔符组合在一起。因此,我使用它来将商品和商店列在同一个列表中。然后我得到了第二份方案清单。我使用后一个列表来拆分第一个列表

url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country"
library(XML)
xmltext <- htmlTreeParse(url, useInternalNodes = T)

## Here I use the combined xpath 
cont.shops <- xpathApply(xmltext, '//*[@id="mw-content-text"]/ul/li|
                                   //*[@id="mw-content-text"]/h3',xmlValue)
cont.shops<- do.call(rbind,cont.shops)                  ## from list to  vector


head(cont.shops)                  ## first element is country followed by shops
     [,1]                   
[1,] "[edit] Â Tunisia"     
[2,] "Magasin Général"
[3,] "Mercure Market"       
[4,] "Promogro"             
[5,] "Geant"                
[6,] "Carrefour"            
## I get all the contries in one list 
contries <- xpathApply(xmltext, '//*[@id="mw-content-text"]/h3',xmlValue)
contries <- do.call(rbind,contries)                     ## from list to  vector

    head(contries)
     [,1]                   
[1,] "[edit] Â Tunisia"     
[2,] "[edit] Â Morocco"     
[3,] "[edit] Â Ghana"       
[4,] "[edit] Â Kenya"       
[5,] "[edit] Â Nigeria"     
[6,] "[edit] Â South Africa"

考虑到文档的结构,使用SAX解析文档可能比使用DOM树更容易,这非常有帮助。非常感谢。
dd <- which(cont.shops %in% contries)                   ## get the index of contries
freq <- c(diff(dd),length(cont.shops)-tail(dd,1)+1)     ## use diff to get Frequencies
contries.f <- rep(contries,freq)                        ## create the factor splitter


ll <- split(cont.shops,contries.f)
> ll[[contries[1]]]
[1] "[edit]  Tunisia"      "Magasin Général" "Mercure Market"        "Promogro"              "Geant"                
[6] "Carrefour"             "Monoprix"             
> ll[[contries[2]]]
[1] "[edit] Â Morocco"                                                         
[2] "Alpha 55, one 6-story store in Casablanca"                                
[3] "Galeries Lafayette, to open in 2011[1] within Morocco Mall, in Casablanca"