Html 使用R.asp创建.asp站点
我在刮Html 使用R.asp创建.asp站点,html,xml,r,web-scraping,Html,Xml,R,Web Scraping,我在刮http://www.progarchives.com/album.asp?id=并收到一条警告消息: 警告消息: XML内容似乎不是XML: 刮板分别适用于每个页面,但不适用于URLb1=2:b2=1000 library(RCurl) library(XML) getUrls <- function(b1,b2){ root="http://www.progarchives.com/album.asp?id=" urls <- NULL f
http://www.progarchives.com/album.asp?id=
并收到一条警告消息:
警告消息:XML内容似乎不是XML:
刮板分别适用于每个页面,但不适用于URL
b1=2:b2=1000
library(RCurl)
library(XML)
getUrls <- function(b1,b2){
root="http://www.progarchives.com/album.asp?id="
urls <- NULL
for (bandid in b1:b2){
urls <- c(urls,(paste(root,bandid,sep="")))
}
return(urls)
}
prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}
prog.arch.scraper(urls)
库(RCurl)
库(XML)
getURL这里有另一种使用rvest
和dplyr
的方法:
library(rvest)
library(dplyr)
library(pbapply)
base_url <- "http://www.progarchives.com/album.asp?id=%s"
get_album_info <- function(id) {
pg <- html(sprintf(base_url, id))
data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
stringsAsFactors=FALSE)
}
albums <- bind_rows(pblapply(2:10, get_album_info))
head(albums)
## Source: local data frame [6 x 3]
##
## album date band
## 1 FOXTROT Studio Album, released in 1972 Genesis
## 2 NURSERY CRYME Studio Album, released in 1971 Genesis
## 3 GENESIS LIVE Live, released in 1973 Genesis
## 4 A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
## 6 GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
您不会得到错误页面的任何条目(在本例中,它只返回id 9、10和30的条目)。您可以将每个路径的节点集中的第一个节点子集,而不是xpathApply()
,并调用xmlValue()
。这是我想到的
library(XML)
library(RCurl)
## define the urls and xpath queries
urls <- sprintf("http://www.progarchives.com/album.asp?id=%s", 2:10)
path <- c(album = "//h1", date = "//strong", band = "//h2")
## define a re-usable curl handle for the c-level nodes
curl <- getCurlHandle()
## allocate the result list
out <- vector("list", length(urls))
## do the work
for(u in urls) {
content <- getURL(u, curl = curl)
doc <- htmlParse(content, useInternalNodes = TRUE)
out[[u]] <- lapply(path, function(x) xmlValue(doc[x][[1]]))
free(doc)
}
## structure the result
data.table::rbindlist(out)
# album date band
# 1: FOXTROT Studio Album, released in 1972 Genesis
# 2: NURSERY CRYME Studio Album, released in 1971 Genesis
# 3: GENESIS LIVE Live, released in 1973 Genesis
# 4: A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
# 5: FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
# 6: GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
# 7: GULLIBLES TRAVELS Studio Album, released in 1985 Abel Ganz
# 8: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 9: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
谢谢它起作用了,但我收到一条错误消息,说没有函数“bind_rows”。我重新安装了所有的软件包,但仍然没有成功。rbindlist
成功了。一段时间以来,我一直想进入rvest
,所以你的代码让我对它进行了更详细的研究。谢谢@hrbrmstr。还有一个问题是,sprintf
在html函数中实际做了什么?我对刮取大约48000页感兴趣,但我注意到刮取器在遇到断页时会停止,即内部错误
。处理这些问题的一种方法是,在每个页面上检查一个注释,指出哪些页面被破坏了,并将好的页面连接到相册
对象中,但这很耗时。对于如何处理破损的页面,你有什么建议吗?干杯。解析中出错。响应(r,解析器,encoding=encoding):服务器错误:(500)当序列包含断页时,内部服务器错误
。问题是,没有办法知道哪些是坏的。到目前为止,我已经确认了23、28、29、34、44、86、134、165、188、252、332、350、351、377、378、531、688、758、816、818、876、886:889、937、960、961、976、1002、1054、1084、1103、1116等。可能还有数百个破碎的。签出http://www.progarchives.com/album.asp?id=2347
例如。更新以处理应答中的服务器错误。还添加了rbindlist
@richardscriven。谢谢这很好,除了我遇到了与上面相同的断开链接的问题。
library(XML)
library(RCurl)
## define the urls and xpath queries
urls <- sprintf("http://www.progarchives.com/album.asp?id=%s", 2:10)
path <- c(album = "//h1", date = "//strong", band = "//h2")
## define a re-usable curl handle for the c-level nodes
curl <- getCurlHandle()
## allocate the result list
out <- vector("list", length(urls))
## do the work
for(u in urls) {
content <- getURL(u, curl = curl)
doc <- htmlParse(content, useInternalNodes = TRUE)
out[[u]] <- lapply(path, function(x) xmlValue(doc[x][[1]]))
free(doc)
}
## structure the result
data.table::rbindlist(out)
# album date band
# 1: FOXTROT Studio Album, released in 1972 Genesis
# 2: NURSERY CRYME Studio Album, released in 1971 Genesis
# 3: GENESIS LIVE Live, released in 1973 Genesis
# 4: A TRICK OF THE TAIL Studio Album, released in 1976 Genesis
# 5: FROM GENESIS TO REVELATION Studio Album, released in 1969 Genesis
# 6: GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
# 7: GULLIBLES TRAVELS Studio Album, released in 1985 Abel Ganz
# 8: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 9: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
getAlbums <- function(url, id = numeric(), xPath = list()) {
urls <- sprintf("%s?id=%d", url, id)
curl <- getCurlHandle()
out <- vector("list", length(urls))
for(u in urls) {
out[[u]] <- if(url.exists(u)) {
content <- getURL(u, curl = curl)
doc <- htmlParse(content, useInternalNodes = TRUE)
lapply(path, function(x) xmlValue(doc[x][[1]]))
} else {
warning(sprintf("returning 'NA' for urls[%d] ", id[urls == u]))
structure(as.list(path[NA]), names = names(path))
}
if(exists("doc")) free(doc)
}
data.table::rbindlist(out)
}
url <- "http://www.progarchives.com/album.asp"
id <- c(9:10, 23, 28, 29, 30)
path <- c(album = "//h1", date = "//strong", band = "//h2")
getAlbums(url, id, path)
# album date band
# 1: THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 2: THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz
# 3: NA NA NA
# 4: NA NA NA
# 5: NA NA NA
# 6: AD INFINITUM Studio Album, released in 1998 Ad Infinitum
#
# Warning messages:
# 1: In albums(url, id, path) : returning 'NA' for urls[23]
# 2: In albums(url, id, path) : returning 'NA' for urls[28]
# 3: In albums(url, id, path) : returning 'NA' for urls[29]