Html XPath 1.0表达式返回NULL_Html_R_Xpath_Html Parsing_Rvest

Html XPath 1.0表达式返回NULL

html r xpath

Html XPath 1.0表达式返回NULL,html,r,xpath,html-parsing,rvest,Html,R,Xpath,Html Parsing,Rvest,从这个网站上，这部分HTML代码有我想要提取的内容，即公司办事处所在的四个城市（诺克斯维尔、孟菲斯、纳什维尔和塞维耶维尔）诺克斯维尔我尝试了这些XPath搜索的几种变体 require(XML) require(httr) doc <- content(GET('http://www.lewisthomason.com/locations/')) xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, t

从这个网站上，这部分HTML代码有我想要提取的内容，即公司办事处所在的四个城市（诺克斯维尔、孟菲斯、纳什维尔和塞维耶维尔）


诺克斯维尔

我尝试了这些XPath搜索的几种变体

require(XML)
require(httr)
doc <- content(GET('http://www.lewisthomason.com/locations/'))

xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)

require（XML）
要求（httr）
doc该网站正在检查用户代理。如果您为其提供适当的用户代理，它将向您发送正确的内容：
require(XML)
require(RCurl)
myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
doc <- htmlParse(doc)


> xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
[4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"                                       
[5] ""                                                                                                                             
> xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389" 

require（XML）
需要（RCurl）
myAgent getURL（'http://www.lewisthomason.com/locations/')
[1] “\n\n403禁止\n\n被攻击\n您没有访问此服务器上的/locations/\n的权限。\n\n”
通过CSS选择器进行救援（XPath wld也适用）：
library（rvest）#用于刮取
库（httr）#仅适用于用户_代理（）
pg%html\u节点（“h3”）%%>%html\u文本（）
##[1]“诺克斯维尔”“孟菲斯”“纳什维尔”“塞维维尔”
#获取位置
pg%%>%html_节点（“h3~p”）%%>%html_文本（）%%>%。[1:4]
##[1]“一个中心广场，五楼\n620市场街\n邮政信箱2425\n田纳西州诺克斯维尔37901\n电话（865）546-4646\n传真（865）523-6529”
##[2]“田纳西州南部主街2900号，邮编38103\n电话（901）525-8721\n传真（901）525-6722”
##[3]“教堂街424号2500室\n邮政信箱198615\n田纳西州纳什维尔37219\n电话（615）259-1366\n传真（615）259-1389”
##[4]“美国田纳西州维尔维尔布鲁斯街248号2室，邮编37862\n电话（865）429-1999\n传真（865）428-1612”
A wrapper to A wrapper to A wrapper；）事实上：-）虽然这会让人们更容易获取数据，尤其是在小插曲中包含了SelectorGadget
bookmarklet Hadley的情况下。它也非常适合整个新的“管道”时尚。顺便说一句，rvest从magrittr导入%>%，所以您不需要dplyr@hadley，谢谢。我经常使用这三个library调用，现在我只是死记硬背地键入它们：-）
require(XML)
require(RCurl)
myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
doc <- htmlParse(doc)


> xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
[4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"                                       
[5] ""                                                                                                                             
> xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389" 

> getURL('http://www.lewisthomason.com/locations/')
[1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don't have permission to access /locations/\non this server.</p>\n</body></html>\n"

library(rvest) # for scraping
library(httr)  # only for user_agent()

pg <- html_session("http://www.lewisthomason.com/locations/", 
                   user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))

# get names
pg %>% html_nodes("h3") %>% html_text()

## [1] "KNOXVILLE"   "MEMPHIS"     "NASHVILLE"   "SEVIERVILLE"

# get locations
pg %>% html_nodes("h3~p") %>% html_text() %>% .[1:4]

## [1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
## [2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
## [3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
## [4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"