使用rvest刮取多个页面时,文档解析文件出错

使用rvest刮取多个页面时,文档解析文件出错,r,web-scraping,rvest,R,Web Scraping,Rvest,我正试图从一个web论坛的多个页面上刮取链接,但收到一条错误消息,我不知道如何修复 我使用rvest和purr尝试了以下方法 pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>% paste0(1:18000) %>% paste0(c(".html")) i<-1 pages.subset<-pages[1:(i+49)==(i+49)] pages

我正试图从一个web论坛的多个页面上刮取链接,但收到一条错误消息,我不知道如何修复

我使用rvest和purr尝试了以下方法

pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
  paste0(1:18000) %>%
  paste0(c(".html"))

i<-1
pages.subset<-pages[1:(i+49)==(i+49)]
pages.subset<-as_data_frame(pages.subset)

scrape_links<-function(pages.subset){read_html(pages.subset) %>% html_node(".topictitle") %>% html_attr('href')}
links<-map_df(pages.subset, scrape_links)

页面%
浆糊0(1:18000)%>%
粘贴0(c(“.html”))

我虽然我不能100%确定是什么导致了错误,但似乎在
map\u df
命令中以列表形式传递整个data.frame使事情变得一团糟。我重新调整了您的代码:

library(tidyverse)
library(rvest)

pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
  paste0(1:18000) %>%
  paste0(c(".html"))

scrape_links <- function(url) {
  out <- url %>%
    read_html() %>%
    html_node(".topictitle") %>%
    html_attr("href")
  return(out)
}

links <- tibble(page = pages[1:(50) == (50)]) %>%
  mutate(url = map_chr(page, scrape_links))

head(links)
# # A tibble: 6 x 2
#   page                                                                  url                                                                                                            
#   <chr>                                                                 <chr>                                                                                                          
# 1 https://www.immigrationboards.com/eea-route-applications/page50.html  https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 2 https://www.immigrationboards.com/eea-route-applications/page100.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 3 https://www.immigrationboards.com/eea-route-applications/page150.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 4 https://www.immigrationboards.com/eea-route-applications/page200.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 5 https://www.immigrationboards.com/eea-route-applications/page250.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 6 https://www.immigrationboards.com/eea-route-applications/page300.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
库(tidyverse)
图书馆(rvest)
页数%
浆糊0(1:18000)%>%
粘贴0(c(“.html”))
刮除链接%
html_节点(“.topictitle”)%>%
html_attr(“href”)
返回(输出)
}
链接%
变异(url=map\u chr(页面、刮取链接))
标题(链接)
##tibble:6 x 2
#页面url
#                                                                                                                                                                              
# 1 https://www.immigrationboards.com/eea-route-applications/page50.html  https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 2 https://www.immigrationboards.com/eea-route-applications/page100.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 3 https://www.immigrationboards.com/eea-route-applications/page150.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 4 https://www.immigrationboards.com/eea-route-applications/page200.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 5 https://www.immigrationboards.com/eea-route-applications/page250.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 6 https://www.immigrationboards.com/eea-route-applications/page300.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html

虽然我不能100%确定是什么导致了错误,但在
map\u df
命令中以列表形式传递整个data.frame似乎把事情搞砸了。我重新调整了您的代码:

library(tidyverse)
library(rvest)

pages <- c("https://www.immigrationboards.com/eea-route-applications/page") %>%
  paste0(1:18000) %>%
  paste0(c(".html"))

scrape_links <- function(url) {
  out <- url %>%
    read_html() %>%
    html_node(".topictitle") %>%
    html_attr("href")
  return(out)
}

links <- tibble(page = pages[1:(50) == (50)]) %>%
  mutate(url = map_chr(page, scrape_links))

head(links)
# # A tibble: 6 x 2
#   page                                                                  url                                                                                                            
#   <chr>                                                                 <chr>                                                                                                          
# 1 https://www.immigrationboards.com/eea-route-applications/page50.html  https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 2 https://www.immigrationboards.com/eea-route-applications/page100.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 3 https://www.immigrationboards.com/eea-route-applications/page150.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 4 https://www.immigrationboards.com/eea-route-applications/page200.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 5 https://www.immigrationboards.com/eea-route-applications/page250.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 6 https://www.immigrationboards.com/eea-route-applications/page300.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
库(tidyverse)
图书馆(rvest)
页数%
浆糊0(1:18000)%>%
粘贴0(c(“.html”))
刮除链接%
html_节点(“.topictitle”)%>%
html_attr(“href”)
返回(输出)
}
链接%
变异(url=map\u chr(页面、刮取链接))
标题(链接)
##tibble:6 x 2
#页面url
#                                                                                                                                                                              
# 1 https://www.immigrationboards.com/eea-route-applications/page50.html  https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 2 https://www.immigrationboards.com/eea-route-applications/page100.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 3 https://www.immigrationboards.com/eea-route-applications/page150.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 4 https://www.immigrationboards.com/eea-route-applications/page200.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 5 https://www.immigrationboards.com/eea-route-applications/page250.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html
# 6 https://www.immigrationboards.com/eea-route-applications/page300.html https://www.immigrationboards.com/eea-route-applications/covid-19-home-office-immigration-guidance-t299291.html