Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中有没有一种方法可以浏览网站上的多个页面_R_Web Scraping - Fatal编程技术网

在R中有没有一种方法可以浏览网站上的多个页面

在R中有没有一种方法可以浏览网站上的多个页面,r,web-scraping,R,Web Scraping,我不太会拉网和拉屎。为了练习,我试图从一个有多个页面的假网站上刮取书名http://books.toscrape.com/catalogue/page-1.html”),然后根据书名计算某些指标。每页和50页上有20本书,我已经设法为前20本书收集和计算指标,但是我想为网站上完整的1000本书计算指标 当前输出如下所示: [1] "A Light in the Attic"

我不太会拉网和拉屎。为了练习,我试图从一个有多个页面的假网站上刮取书名http://books.toscrape.com/catalogue/page-1.html”),然后根据书名计算某些指标。每页和50页上有20本书,我已经设法为前20本书收集和计算指标,但是我想为网站上完整的1000本书计算指标

当前输出如下所示:

 [1] "A Light in the Attic"                                                                          
 [2] "Tipping the Velvet"                                                                            
 [3] "Soumission"                                                                                    
 [4] "Sharp Objects"                                                                                 
 [5] "Sapiens: A Brief History of Humankind"                                                         
 [6] "The Requiem Red"                                                                               
 [7] "The Dirty Little Secrets of Getting Your Dream Job"                                            
 [8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"       
 [9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"                                                                               
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"                                                
[12] "Shakespeare's Sonnets"                                                                         
[13] "Set Me Free"                                                                                   
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"                                       
[15] "Rip it Up and Start Again"                                                                     
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"            
[17] "Olio"                                                                                          
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"                                         
[19] "Libertarianism for Beginners"                                                                  
[20] "It's Only the Himalayas"
我希望这是1000本书,而不是20本,这将允许我使用相同的代码来计算指标,但对于1000本书,而不是20本书

代码:

url%
读取html()%>%
html_节点('h3 a')%>%
html_attr('title')->标题
标题

从网站上搜刮每一本书,把1000本书的书名从20本改为1000本,最好的办法是什么?提前谢谢。

也许是这样的吧

library(tidyverse)
library(rvest)
library(data.table)
# Vector with URL's to scrape
url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
# Scrape to list
L <- lapply( url, function(x) {
  print( paste0( "scraping: ", x, " ... " ) )
  data.table(titles = read_html(x) %>%
              html_nodes('h3 a') %>%
              html_attr('title') )
})
# Bind list to single data.table
data.table::rbindlist(L, use.names = TRUE, fill = TRUE)
库(tidyverse)
图书馆(rvest)
库(数据表)
#带有要刮取的URL的向量
url%
html_attr('title'))
})
#将列表绑定到单个data.table
data.table::rbindlist(L,use.names=TRUE,fill=TRUE)

生成50个URL,然后对它们进行迭代,例如使用
purrr::map

library(rvest)

urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')

titles <- purrr::map(
  urls, 
  . %>% 
    read_html() %>%
    html_nodes('h3 a') %>%
    html_attr('title')
)
库(rvest)
URL%
html_节点('h3 a')%>%
html_attr('标题')
)
library(rvest)

urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')

titles <- purrr::map(
  urls, 
  . %>% 
    read_html() %>%
    html_nodes('h3 a') %>%
    html_attr('title')
)