将for循环更改为刮取网站的函数
我正在尝试使用以下工具刮取网站:将for循环更改为刮取网站的函数,r,function,R,Function,我正在尝试使用以下工具刮取网站: industryurl <- "https://finance.yahoo.com/industries" library(rvest) read <- read_html(industryurl) %>% html_table() library(plyr) industries <- ldply(read, data.frame) industries = industries[-1,] read <- read_h
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
industryurl这是我想到的函数
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
##第一个参数是您需要的链接
##第二个参数是Sys.sleep的总时间
提取功能这是伟大的作品!信息量也很大,非常感谢!一个问题是lappy(1:5…
我改为lappy(1:length(links)
,它停止在5
,但是有214个链接要提取。我在open.connection(x,“rb”)中得到这个错误错误:HTTP错误503。
让我看一看错误是503错误,这大致意味着服务器端有问题,无法满足请求。如果删除有问题的链接,代码应该可以工作。503错误的来源:嗯,似乎是链接https://finance.yahoo.com/industry/Manufactured_Housing
i这导致了它的崩溃。在这里的雅虎财经链接上也是如此,https://finance.yahoo.com/industries