Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/backbone.js/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Web scraping 使用rvest进行刮取:获取错误HTTP 502_Web Scraping_Rvest_Http Error - Fatal编程技术网

Web scraping 使用rvest进行刮取:获取错误HTTP 502

Web scraping 使用rvest进行刮取:获取错误HTTP 502,web-scraping,rvest,http-error,Web Scraping,Rvest,Http Error,我有一个R脚本,它使用rvest从accuweather中提取一些数据。accuweather URL中有唯一对应于城市的ID。我正在尝试提取给定范围内的ID和相关的城市名称。rvest本身对单个ID非常有效,但当我遍历for循环时,它最终返回此错误-“open.connection(x,“rb”)中的错误:HTTP错误502。” 我怀疑这个错误是由于网站阻止了我。我该怎么做?我想从相当大的范围(10000个ID)中进行刮取,在循环大约500次迭代之后,它一直给我这个错误。我还尝试了closeA

我有一个R脚本,它使用rvest从accuweather中提取一些数据。accuweather URL中有唯一对应于城市的ID。我正在尝试提取给定范围内的ID和相关的城市名称。rvest本身对单个ID非常有效,但当我遍历for循环时,它最终返回此错误-“open.connection(x,“rb”)中的错误:HTTP错误502。”

我怀疑这个错误是由于网站阻止了我。我该怎么做?我想从相当大的范围(10000个ID)中进行刮取,在循环大约500次迭代之后,它一直给我这个错误。我还尝试了
closeAllConnections()
Sys.sleep()
,但没有成功。我真的很感激你能帮我解决这个问题

library(rvest)
library(httr)

# create matrix to store IDs and Cities
# each ID corresponds to a single city 
id_mat<- matrix(0, ncol = 2, nrow = 10001 )

# initialize index for matrix row  
j = 1

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  accu <- read_html(paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = ""))
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
}
编辑:已解决。我在这里找到了一个绕过它的方法:。我使用
tryCatch()
error=function(e)e
作为参数,它抑制了错误消息,并允许循环继续而不中断。希望这对其他陷入类似问题的人有所帮助

library(rvest)
library(httr)

# create matrix to store IDs and Cities
# each ID corresponds to a single city 
id_mat<- matrix(0, ncol = 2, nrow = 10001 )

# initialize index for matrix row  
j = 1

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  accu <- read_html(paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = ""))
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
}
库(rvest)
图书馆(httr)
#创建矩阵以存储ID和城市
#每个ID对应一个城市

这个问题似乎来自科学记数法

我稍微更改了您的代码,现在它似乎可以工作了:

library(rvest)
library(httr)

id_mat<- matrix(0, ncol = 2, nrow = 10001 )

readUrl <- function(url) {
out <- tryCatch(
{   
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  return(1)
},
error=function(cond) {

  return(0)
},
warning=function(cond) {
  return(0)
}
)    
return(out)
}

j = 1

options(scipen = 999)

for (i in 300000:310000){
  z <- as.character(i)
# pull city name from website 
  url <- paste("https://www.accuweather.com/en/us/new-york-ny/10007/june-weather/", z, sep = "")
  if( readUrl(url)==1) {
  download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
  accu <- read_html("scrapedpage.html")
  citystate <- accu %>% html_nodes('h1') %>% html_text()
# store values
  id_mat[j,1] = i
  id_mat[j,2] = citystate
# increment by 1 
  i = i + 1 
  j = j + 1
    # close connection after 200 pulls, wait 5 mins and loop again
    if (i %% 200 == 0) {
        closeAllConnections()
        Sys.sleep(300)
        next 
  } else {
        # sleep for 1 or 2 seconds every loop
        Sys.sleep(sample(2,1))
  }
   } else {er <- 1}
  }
库(rvest)
图书馆(httr)

id_matOkay,所以我尝试了你的代码,它工作了大约1300个id,然后我收到了类似的错误消息:“In download.file(url,destfile=“scrapedpage.html”,quiet=TRUE):无法打开URL“”:HTTP状态为“502坏网关”为避免出现此类错误,您可以参考此答案,我编辑了上面的代码,包括从此处读取URL函数: