Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/77.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Html XPath 1.0表达式返回NULL_Html_R_Xpath_Html Parsing_Rvest - Fatal编程技术网

Html XPath 1.0表达式返回NULL

Html XPath 1.0表达式返回NULL,html,r,xpath,html-parsing,rvest,Html,R,Xpath,Html Parsing,Rvest,从这个网站上,这部分HTML代码有我想要提取的内容,即公司办事处所在的四个城市(诺克斯维尔、孟菲斯、纳什维尔和塞维耶维尔) 诺克斯维尔 我尝试了这些XPath搜索的几种变体 require(XML) require(httr) doc <- content(GET('http://www.lewisthomason.com/locations/')) xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, t

从这个网站上,这部分HTML代码有我想要提取的内容,即公司办事处所在的四个城市(诺克斯维尔、孟菲斯、纳什维尔和塞维耶维尔)


诺克斯维尔

我尝试了这些XPath搜索的几种变体

require(XML)
require(httr)
doc <- content(GET('http://www.lewisthomason.com/locations/'))

xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
require(XML)
要求(httr)

doc该网站正在检查用户代理。如果您为其提供适当的用户代理,它将向您发送正确的内容:

require(XML)
require(RCurl)
myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
doc <- htmlParse(doc)


> xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
[4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"                                       
[5] ""                                                                                                                             
> xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389" 
require(XML)
需要(RCurl)
myAgent getURL('http://www.lewisthomason.com/locations/')
[1] “\n\n403禁止\n\n被攻击\n您没有访问此服务器上的/locations/\n的权限。

\n\n”
通过CSS选择器进行救援(XPath wld也适用):

library(rvest)#用于刮取
库(httr)#仅适用于用户_代理()
pg%html\u节点(“h3”)%%>%html\u文本()
##[1]“诺克斯维尔”“孟菲斯”“纳什维尔”“塞维维尔”
#获取位置
pg%%>%html_节点(“h3~p”)%%>%html_文本()%%>%。[1:4]
##[1]“一个中心广场,五楼\n620市场街\n邮政信箱2425\n田纳西州诺克斯维尔37901\n电话(865)546-4646\n传真(865)523-6529”
##[2]“田纳西州南部主街2900号,邮编38103\n电话(901)525-8721\n传真(901)525-6722”
##[3]“教堂街424号2500室\n邮政信箱198615\n田纳西州纳什维尔37219\n电话(615)259-1366\n传真(615)259-1389”
##[4]“美国田纳西州维尔维尔布鲁斯街248号2室,邮编37862\n电话(865)429-1999\n传真(865)428-1612”

A wrapper to A wrapper to A wrapper;)事实上:-)虽然这会让人们更容易获取数据,尤其是在小插曲中包含了
SelectorGadget
bookmarklet Hadley的情况下。它也非常适合整个新的“管道”时尚。顺便说一句,rvest从magrittr导入%>%,所以您不需要dplyr@hadley,谢谢。我经常使用这三个
library
调用,现在我只是死记硬背地键入它们:-)
require(XML)
require(RCurl)
myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
doc <- htmlParse(doc)


> xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
[4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"                                       
[5] ""                                                                                                                             
> xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389" 
> getURL('http://www.lewisthomason.com/locations/')
[1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don't have permission to access /locations/\non this server.</p>\n</body></html>\n"
library(rvest) # for scraping
library(httr)  # only for user_agent()

pg <- html_session("http://www.lewisthomason.com/locations/", 
                   user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))

# get names
pg %>% html_nodes("h3") %>% html_text()

## [1] "KNOXVILLE"   "MEMPHIS"     "NASHVILLE"   "SEVIERVILLE"

# get locations
pg %>% html_nodes("h3~p") %>% html_text() %>% .[1:4]

## [1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
## [2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"                                              
## [3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"                  
## [4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"