R 从网站的多个页面中删除表格

R 从网站的多个页面中删除表格,r,web-scraping,rvest,R,Web Scraping,Rvest,我是网络抓取的新手,下面是我的代码,我想从所有页面中抓取表格,或者仅仅前5页就足够了 网站= 不确定下一步如何才能将所有这3页表格放在一个表格中。请帮帮我,非常感谢: 我曾试图运行这段代码,但没有从这段代码中生成表 require(dplyr) require(rvest) options(stringsAsFactors = FALSE) url_base <- "https://finviz.com/screener.ashx?v=152&f=cap_midove

我是网络抓取的新手,下面是我的代码,我想从所有页面中抓取表格,或者仅仅前5页就足够了

网站=

不确定下一步如何才能将所有这3页表格放在一个表格中。请帮帮我,非常感谢:

我曾试图运行这段代码,但没有从这段代码中生成表

require(dplyr)
require(rvest)

options(stringsAsFactors = FALSE)

url_base <- "https://finviz.com/screener.ashx?v=152&f=cap_midover&o=ticker&r="

tbl.clactions <- data.frame(
  "Ticker" = character(0),"Company" = character(0),
  "Sector" = character(0),"Industry" = character(0),
  "Country" = character(0),"Market.Cap" = character(0),
  "P/E" = character(0),"ROA" = character(0),
  "ROE" = character(0),"Price" = character(0),
  "Change" = character(0),"Volume" = character(0)
)

page <- c(0,21,41)

for (i in page) { 
  url <- paste0(url_base, i)
  tbl.page <- url %>%
    read_html() %>%
    html_nodes(xpath='//*[@id="screener-content"]/table/tbody/tr[4]/td/table') %>%
    html_table()
}

这段代码似乎没有错误,这里有一种方法

require(dplyr)
require(rvest)

options(stringsAsFactors = FALSE)

url_base <- "https://finviz.com/screener.ashx?v=152&f=cap_midover&o=ticker&r="

tbl.clactions <- data.frame(
  "Ticker" = character(0),"Company" = character(0),
  "Sector" = character(0),"Industry" = character(0),
  "Country" = character(0),"Market.Cap" = character(0),
  "P/E" = character(0),"ROA" = character(0),
  "ROE" = character(0),"Price" = character(0),
  "Change" = character(0),"Volume" = character(0)
)

page <- c(0,21,41)

for (i in page) { 
  url <- paste0(url_base, i)
  tbl.page <- url %>%
    read_html() %>%
    html_nodes(xpath='//*[@id="screener-content"]/table/tbody/tr[4]/td/table') %>%
    html_table()
}
#Generate all the url's from where we need to extract the data
url_base <- paste0("https://finviz.com/screener.ashx?v=152&f=cap_midover&o=ticker&r=", c(0,21,41))


library(rvest)
library(dplyr)

#Extract the table from each URL and bind them into one table
purrr::map_df(url_base, ~.x %>%
       read_html() %>%
       html_table(fill = TRUE) %>%
      .[[10]] %>%
       setNames(as.character(.[1,])) %>%
       slice(-1))

#   No. Ticker                                         Company           Sector
#1    1      A                      Agilent Technologies, Inc.       Healthcare
#2    2     AA                               Alcoa Corporation  Basic Materials
#3    3   AABA                                     Altaba Inc.        Financial
#4    4    AAL                    American Airlines Group Inc.         Services
#5    5    AAN                                   Aaron's, Inc.         Services
#6    6   AAON                                      AAON, Inc. Industrial Goods
#....

该网站的政策明确禁止这样做:您好,这段代码适合我!非常感谢你的帮助,因为我已经挣扎了几天了!我可以向你澄清一些细节吗?实际上~.x、[10]]、as.character、[1]和slice-1是什么意思?这四件事都有其特定的含义,对吗?这是我在编码方面的弱点,如果你能建议我学习一些书籍或网站,让我从初学者开始学习所有这些编码,我将不胜感激!谢谢你的帮助!:@Kasen在你的for循环中,你如何在页面中为i,.x在这里是相同的,而不是i。每一页上都有多个表,通过这样做。[[10]]我们选择了第十个表,因为这是我们需要的。提取数据时,列名没有正确读取,而是作为行读取。因此,通过执行setNamesas.character.[1],我们将第一行指定为列名,因为第一行是列名,所以我们将其从实际的数据处理切片中删除-1@Kasen我手动检查过了。因为所有的页面都有相同的结构,所以在每一页的第10个表格中都有。谢谢你的解释!我最后一个问题是如何知道我想要的那张桌子是第十张桌子?我应该在哪里添加Sys.sleep5来暂停?@Kasen要添加一些延迟,您可以执行purr::map\u dfurl\u base,~{Sys.sleep5;.x%>%read\u html%>%html\u tablefill=TRUE%>%.[10]%>%setNamesas.character.[1,]%>%slice-1}。就手动检查而言,我通过执行url_base[1]]>%read_html%>%html_tablefill=TRUE对一个页面进行了检查,然后查看了表格的位置。