Html XPath 1.0表达式返回NULL
从这个网站上,这部分HTML代码有我想要提取的内容,即公司办事处所在的四个城市(诺克斯维尔、孟菲斯、纳什维尔和塞维耶维尔)Html XPath 1.0表达式返回NULL,html,r,xpath,html-parsing,rvest,Html,R,Xpath,Html Parsing,Rvest,从这个网站上,这部分HTML代码有我想要提取的内容,即公司办事处所在的四个城市(诺克斯维尔、孟菲斯、纳什维尔和塞维耶维尔) 诺克斯维尔 我尝试了这些XPath搜索的几种变体 require(XML) require(httr) doc <- content(GET('http://www.lewisthomason.com/locations/')) xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, t
诺克斯维尔
我尝试了这些XPath搜索的几种变体
require(XML)
require(httr)
doc <- content(GET('http://www.lewisthomason.com/locations/'))
xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
require(XML)
要求(httr)
doc该网站正在检查用户代理。如果您为其提供适当的用户代理,它将向您发送正确的内容:
require(XML)
require(RCurl)
myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
doc <- htmlParse(doc)
> xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"
[4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"
[5] ""
> xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"
require(XML)
需要(RCurl)
myAgent getURL('http://www.lewisthomason.com/locations/')
[1] “\n\n403禁止\n\n被攻击\n您没有访问此服务器上的/locations/\n的权限。\n\n”
通过CSS选择器进行救援(XPath wld也适用):
library(rvest)#用于刮取
库(httr)#仅适用于用户_代理()
pg%html\u节点(“h3”)%%>%html\u文本()
##[1]“诺克斯维尔”“孟菲斯”“纳什维尔”“塞维维尔”
#获取位置
pg%%>%html_节点(“h3~p”)%%>%html_文本()%%>%。[1:4]
##[1]“一个中心广场,五楼\n620市场街\n邮政信箱2425\n田纳西州诺克斯维尔37901\n电话(865)546-4646\n传真(865)523-6529”
##[2]“田纳西州南部主街2900号,邮编38103\n电话(901)525-8721\n传真(901)525-6722”
##[3]“教堂街424号2500室\n邮政信箱198615\n田纳西州纳什维尔37219\n电话(615)259-1366\n传真(615)259-1389”
##[4]“美国田纳西州维尔维尔布鲁斯街248号2室,邮编37862\n电话(865)429-1999\n传真(865)428-1612”
A wrapper to A wrapper to A wrapper;)事实上:-)虽然这会让人们更容易获取数据,尤其是在小插曲中包含了SelectorGadget
bookmarklet Hadley的情况下。它也非常适合整个新的“管道”时尚。顺便说一句,rvest从magrittr导入%>%,所以您不需要dplyr@hadley,谢谢。我经常使用这三个library
调用,现在我只是死记硬背地键入它们:-)
require(XML)
require(RCurl)
myAgent <- "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
doc <- getURL('http://www.lewisthomason.com/locations/', useragent = myAgent)
doc <- htmlParse(doc)
> xpathSApply(doc, "//div[@id = 'the_content']/div//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"
[4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"
[5] ""
> xpathSApply(doc, "//div[@class = 'one_fourth']//p", xmlValue, trim = TRUE)
[1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
[2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"
[3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"
> getURL('http://www.lewisthomason.com/locations/')
[1] "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>403 Forbidden</title>\n</head><body>\n<h1>Forbidden</h1>\n<p>You don't have permission to access /locations/\non this server.</p>\n</body></html>\n"
library(rvest) # for scraping
library(httr) # only for user_agent()
pg <- html_session("http://www.lewisthomason.com/locations/",
user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))
# get names
pg %>% html_nodes("h3") %>% html_text()
## [1] "KNOXVILLE" "MEMPHIS" "NASHVILLE" "SEVIERVILLE"
# get locations
pg %>% html_nodes("h3~p") %>% html_text() %>% .[1:4]
## [1] "One Centre Square, Fifth Floor\n620 Market Street\nPO Box 2425\nKnoxville, TN 37901\nPhone (865) 546-4646\nFax (865) 523-6529"
## [2] "40 S Main St #2900\nMemphis, TN 38103\nPhone (901) 525-8721\nFax (901) 525-6722"
## [3] "424 Church Street, Suite 2500\nPO Box 198615\nNashville, TN 37219\nPhone (615) 259-1366\nFax (615) 259-1389"
## [4] "248 Bruce St, Suite 2\nSevierville, TN 37862\nPhone (865) 429-1999\nFax (865) 428-1612"