在R中,如何在没有空网站问题的情况下进行web抓取?
我需要提取有关物种的信息,并编写以下代码。然而,我对一些缺失的物种有一个问题。如何避免这个问题在R中,如何在没有空网站问题的情况下进行web抓取?,r,loops,pagination,data-science,rvest,R,Loops,Pagination,Data Science,Rvest,我需要提取有关物种的信息,并编写以下代码。然而,我对一些缺失的物种有一个问题。如何避免这个问题 Q<-c("rvest","stringr","tidyverse","jsonlite") lapply(Q,require,character.only=TRUE) #This part was obtained by pagination that I not provided to have a short code sp1<-as.matrix(c("https://www.g
Q<-c("rvest","stringr","tidyverse","jsonlite")
lapply(Q,require,character.only=TRUE)
#This part was obtained by pagination that I not provided to have a short code
sp1<-as.matrix(c("https://www.gulfbase.org/species/Acanthilia-intermedia", "https://www.gulfbase.org/species/Achelous-floridanus", "https://www.gulfbase.org/species/Achelous-ordwayi", "https://www.gulfbase.org/species/Achelous-spinicarpus","https://www.gulfbase.org/species/Achelous-spinimanus",
"https://www.gulfbase.org/species/Agolambrus-agonus",
"https://www.gulfbase.org/species/Agononida-longipes",
"https://www.gulfbase.org/species/Amphithrax-aculeatus",
"https://www.gulfbase.org/species/Anasimus-latus"))
> sp1
GiveMeData<-function(url){
sp1<-read_html(url)
sp1selmax<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div.figures--joined > div:nth-child(1)"
Mindepth<-html_node(sp1,sp1selmax)
mintext<-html_text(Mindepth)
mintext
sp1selmax<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div.figures--joined > div:nth-child(2)"
Maxdepth<-html_node(sp1,sp1selmax)
maxtext<-html_text(Maxdepth)
maxtext
sp1seldist<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div:nth-child(2) > div:nth-child(2) > div"
Distr<-html_node(sp1,sp1seldist)
distext<-html_text(Distr)
distext
sp1habitat<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div:nth-child(3) > ul"
Habit<-html_node(sp1,sp1habitat)
habtext<-html_text(Habit)
habtext
sp1habitat2<-"#block-beaker-content > article > div > main > section.node--full__main > div.node--full__figures > div.field > ul > li"
Habit2<-html_node(sp1,sp1habitat2)
habtext2<-html_text(Habit2)
habtext2
sp1ref<-"#block-beaker-content > article > div > main > section.node--full__related"
Ref<-html_node(sp1,sp1ref)
reftext<-html_text(Ref)
reftext
mintext<-gsub("\n \n Min Depth\n \n \n ","",mintext)
mintext<-gsub(" meters\n \n ","",mintext)
maxtext<-gsub("\n \n Max Depth\n \n \n ","",maxtext)
maxtext<-gsub(" meters\n \n","",maxtext)
habtext<-gsub("\n",",",habtext)
habtext<-gsub("\\s","",habtext)
reftext<-gsub("\n\n",";",reftext)
reftext<-gsub("\\s","",reftext)
Info<-rbind(Info=c("Min", "Max", "Distribution", "Habitat", "MicroHabitat", "References"),Data=c(mintext,maxtext,distext,habtext,habtext2,reftext))
}
doit<-lapply(pag[1:10],GiveMeData)
Q我想可能有办法改进GiveMeData
功能,但使用现有功能,我们可以使用tryCatch
忽略返回错误的网站
output <- lapply(c(sp1), function(x) tryCatch(GiveMeData(x), error = function(e){}))
输出