使用lappy和tryCatch只返回一条记录，而不是多条记录_R_Web Scraping_Rvest

使用lappy和tryCatch只返回一条记录，而不是多条记录

r web-scraping

使用lappy和tryCatch只返回一条记录，而不是多条记录,r,web-scraping,rvest,R,Web Scraping,Rvest,我对R还是新手，我正在尝试使用Lappy和tryCatch通过职业足球参考网站访问数据，但我在下面使用的代码只返回1个表我想得到所有存在的记录。一些URL将不存在，这就是我使用tryCatch跳过这些URL的原因。以下是我的尝试： library(rvest) library(stringr) #create a master dataframe to store all of the results complete <- data.frame() datesVector <

我对R还是新手，我正在尝试使用Lappy和tryCatch通过职业足球参考网站访问数据，但我在下面使用的代码只返回1个表

我想得到所有存在的记录。一些URL将不存在，这就是我使用tryCatch跳过这些URL的原因。以下是我的尝试：

library(rvest)
library(stringr)

#create a master dataframe to store all of the results
complete <- data.frame()

datesVector <- c("201909080", "201909050")
teamsVector <- c("chi", "crd", "car")

  for (i in 1:length(datesVector)) {
  for (j in 1:length(teamsVector)) {
    
    # create a url template 
    URL.base <- "http://www.pro-football-reference.com/boxscores/"
    URL.end <- ".htm"
    
    #create the dataframe with the dynamic values
    URL <- paste0(URL.base, datesVector[i], teamsVector[j], URL.end)
    
    tryCatch({
      
       dfList <- lapply(URL, function(i) {
        webpage <- read_html(i)
        draft_table <- html_nodes(webpage, 'table')
        draft <- html_table(draft_table)[[3]]
                   
      })
    }, error = function(e) print(URL)
    )               
    
    complete <- do.call(rbind, dfList) 
            
  }
}

事实上，还有另一个页面存在（因此应该返回数据），即另外22个观测：

https://www.pro-football-reference.com/boxscores/201909080car.htm

上面已经包括了另一个团队，他们试图测试错误并跳过它

我在上述问题上做错了什么？

下面是一个使用

purrr

和其他一些

tidyverse

方法的任务解决方案。见下面的解释

library(tidyverse)

combos <- expand.grid(team = teamsVector, date = datesVector)
urls <- paste0(URL.base, combos$date, combos$team, URL.end)

output <- urls %>% 
  map(possibly(read_html, otherwise = NA_character_)) %>%
  discard(is.character) %>%
  map(function(html) html_nodes(html, 'table') %>% html_table) %>%
  pluck(3) %>%
  map_dfr(function(data) {
    col_lvl1 <- colnames(data)
    col_lvl2 <- data[1, ] %>% unname
    actual_data <- data[2:nrow(data), ] %>% unname
    col_multi_lvl <- paste(col_lvl1, col_lvl2, sep = "_")
    colnames(actual_data) <- col_multi_lvl
    return(actual_data %>% as_tibble(.name_repair = make.names))
  }) %>% 
  filter(str_detect(Passing_Cmp, "Passing|Cmp", negate = TRUE)) %>%
  mutate_at(vars(-starts_with("X_")), as.numeric)

然后使用

map（）

，将其输入到一个

包装的read\u html（）
：
坏的URL将产生类型为character
的NA
值，因此我们可以discard（）
返回字符项（其他所有内容都是一些HTML对象）：
这就给我们留下了一个有效的抓取网页列表，我们可以再次使用map（）
遍历该列表并提取表数据。pluck（）
函数是一种很好的列表子集设置方法，可以获得所需的[[3]]
表索引：
map(function(html) html_nodes(html, 'table') %>% html_table) %>%
  pluck(3) %>%

这些表有一个棘手的问题，就是它们有多个索引列——最上面的列包含“传递”或“接收”之类的类别，而子列则指定一个统计信息（“TD”、“Int”等）。对于R来说，将其放入数据帧并不容易，因为它需要一组列名，而不是两组。这需要一些黑客技术。我选择拉出列名和子列名，并将它们与下划线组合（例如“Passing_TD”）：
我不太清楚为什么要使用lappy
，因为你设置URL
向量的方式总是长度为1，所以只有一件事要看。另一个问题似乎在你的循环中，因为你一直在用迭代的结果替换complete
的值，而不是附加到之前的结果。好的，我还在学习，你能提供一个关于如何设置的提示吗？为什么URL
向量的长度总是为1？哇，这太棒了！我很感激！这个解释也很有帮助，谢谢。但是，我在2个csv文件（dates.csv和teams.csv）中有许多日期和团队，当我使用这些文件时，会出现以下错误：error:Problem with
filter（）`input.1
。在输入时未找到x对象“Passing\u Cmp”。1
是str\u detect（Passing\u Cmp，“Passing\u Cmp”，negate=TRUE）
。运行rlang:：last_error（）查看错误发生的位置。`是否还有一种方法可以包括周号和季节号。因为我想跨多个季节运行此操作？我不认为这些出现在表格中，所以不确定它是否可以完成？嗨，在你的帖子中，你把问题框定为直接从URL，而不是从CSV。这是一种完全不同的方法。您能否确认此代码运行时没有错误，并返回您期望从这些网页中获得的数据？那是第一个开始。至于你的第二个问题，我不知道。解决任何问题的最好方法就是一步一个脚印。最好是打开一个有单独问题的新问题，而不是尝试在同一个问题中添加注释。是的，它运行时没有错误，并返回我期望的数据。上面我确实说过，我将有一个很长的向量，包含多个日期和团队，我本应该更清楚这一点。我将分别考虑第二个问题。
output %>% head(15)
# A tibble: 15 x 22
   X_Player X_Tm  Passing_Cmp Passing_Att Passing_Yds...5 Passing_TD Passing_Int Passing_Sk Passing_Yds...9
   <chr>    <chr>       <dbl>       <dbl>           <dbl>      <dbl>       <dbl>      <dbl>           <dbl>
 1 Matthew… DET            27          45             385          3           0          3              24
 2 Kerryon… DET             0           0               0          0           0          0               0
 3 C.J. An… DET             0           0               0          0           0          0               0
 4 Ty John… DET             0           0               0          0           0          0               0
 5 Marvin … DET             0           0               0          0           0          0               0
 6 T.J. Ho… DET             0           0               0          0           0          0               0
 7 Danny A… DET             0           0               0          0           0          0               0
 8 Kenny G… DET             0           0               0          0           0          0               0
 9 J.D. Mc… DET             0           0               0          0           0          0               0
10 Jesse J… DET             0           0               0          0           0          0               0
11 Nick Ba… DET             0           0               0          0           0          0               0
12 Kyler M… ARI            29          54             308          2           1          5              33
13 David J… ARI             0           0               0          0           0          0               0
14 Christi… ARI             0           0               0          0           0          0               0
15 Chase E… ARI             0           0               0          0           0          0               0
# … with 13 more variables: Passing_Lng <dbl>, Passing_Rate <dbl>, Rushing_Att <dbl>, Rushing_Yds <dbl>,
#   Rushing_TD <dbl>, Rushing_Lng <dbl>, Receiving_Tgt <dbl>, Receiving_Rec <dbl>, Receiving_Yds <dbl>,
#   Receiving_TD <dbl>, Receiving_Lng <dbl>, Fumbles_Fmb <dbl>, Fumbles_FL <dbl>

combos <- expand.grid(team=teamsVector, date=datesVector)
urls <- paste0(URL.base, combos$date, combos$team, URL.end)

urls
[1] "http://www.pro-football-reference.com/boxscores/201909080chi.htm"
[2] "http://www.pro-football-reference.com/boxscores/201909080crd.htm"
[3] "http://www.pro-football-reference.com/boxscores/201909080car.htm"
[4] "http://www.pro-football-reference.com/boxscores/201909050chi.htm"
[5] "http://www.pro-football-reference.com/boxscores/201909050crd.htm"
[6] "http://www.pro-football-reference.com/boxscores/201909050car.htm"

urls %>% 
  map(possibly(read_html, otherwise = NA_character_)) # throw NA if bad URL

discard(is.character) %>%

map(function(html) html_nodes(html, 'table') %>% html_table) %>%
  pluck(3) %>%

# note: map_dfr just specifies a data frame output, joined row-wise
map_dfr(function(data) {
    col_lvl1 <- colnames(data)
    col_lvl2 <- data[1, ] %>% unname
    actual_data <- data[2:nrow(data), ] %>% unname
    col_multi_lvl <- paste(col_lvl1, col_lvl2, sep = "_")
    colnames(actual_data) <- col_multi_lvl
    return(actual_data %>% as_tibble(.name_repair = make.names))
  }) %>% 

  filter(str_detect(Passing_Cmp, "Passing|Cmp", negate = TRUE)) %>%
  mutate_at(vars(-starts_with("X_")), as.numeric)