Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/77.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用rvest(R)进行web抓取时中断的URL_R_Web Scraping_Rvest - Fatal编程技术网

使用rvest(R)进行web抓取时中断的URL

使用rvest(R)进行web抓取时中断的URL,r,web-scraping,rvest,R,Web Scraping,Rvest,我已经建立了一个函数,它获取一个url,并在抓取网页后返回所需的结果。功能如下所述: library(httr) library(curl) library(rvest) library(dplyr) sd_cat <- function(url){ cat <- curl(url, handle = new_handle("useragent" = "myua")) %>% read_html() %>% html_nodes("#breadCrum

我已经建立了一个函数,它获取一个url,并在抓取网页后返回所需的结果。功能如下所述:

library(httr) 
library(curl) 
library(rvest) 
library(dplyr)

sd_cat <- function(url){
  cat <- curl(url, handle = new_handle("useragent" = "myua")) %>%
  read_html() %>%
  html_nodes("#breadCrumbWrapper") %>%
  html_text()

x <- cat[1]

#y <- gsub(pattern = "\n", x=x, replacement = " ")

y <- gsub(pattern = "\t", x=x, replacement = " ")

y <- gsub("\\d|,|\t", x=y, replacement = "")

y <- gsub("^ *|(?<= ) | *$", "", y, perl=T)

z <- gsub("\n*{2,}","",y)

z <- gsub(" {2,}",">",z)

final <- substring(z,2)

final <- substring(final,1,nchar(final)-1)

final

#sample discontinued url: "http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261"
#sample working url: "http://www.snapdeal.com//product/davidoff-cool-water-game-100ml/1339014133"
}
库(httr)
图书馆(curl)
图书馆(rvest)
图书馆(dplyr)
sd_类别%
html#U节点(“#面包屑回收器”)%>%
html_text()

x可以尝试使用
for
,而不是
sapply
。您可以毫无问题地使用
tryCatch()

url <- c("first_url", "second_url")
result <- vector("list", length(url))

for(i in 1:length(url)){
    result[[i]] <- tryCatch({sd_cat(url[i])}, error=function(err) "Error 404")
}

url更好的解决方案是使用httr,并在响应不正常时故意采取措施:

library(httr) 
library(rvest) 

sd_cat <- function(url){
  r <- GET(url, user_agent("myua"))
  if (status_code(r) >= 300)
    return(NA_character_)

  r %>%
    read_html() %>%
    html_nodes("#breadCrumbWrapper") %>%
    .[[1]] %>% 
    html_nodes("span") %>% 
    html_text()
}

sd_cat("http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261")
sd_cat("http://www.snapdeal.com//product/davidoff-cool-water-game-100ml/1339014133")
库(httr)
图书馆(rvest)
sd_类别%
读取html()%>%
html#U节点(“#面包屑回收器”)%>%
.[[1]] %>% 
html_节点(“span”)%>%
html_text()
}
sd_猫(“http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261")
sd_猫(“http://www.snapdeal.com//product/davidoff-cool-water-game-100ml/1339014133")
(我还使用更好的rvest替换了正则表达式)