R 存在多个问题的web抓取循环_R_Web Scraping_Rvest

R 存在多个问题的web抓取循环

r web-scraping

R 存在多个问题的web抓取循环,r,web-scraping,rvest,R,Web Scraping,Rvest,我有一个问题，我的R代码下载框分数从一个网站 for (i in Sites) { try({log("a")}, silent=TRUE) webpage_url <- i webpage <- xml2::read_html(webpage_url) table <- rvest::html_table(webpage, fill=TRUE)[[1]] } #Here's an example url "https://www.base

我有一个问题，我的R代码下载框分数从一个网站

for (i in Sites) {
try({log("a")}, silent=TRUE)
webpage_url <- i
webpage <- xml2::read_html(webpage_url)

table <- rvest::html_table(webpage, fill=TRUE)[[1]]
  
}
#Here's an example url
"https://www.baseball-almanac.com/box-scores/boxscore.php?boxid=202007310COL"

用于（现场中的i）{
try（{log（“a”）}，silent=TRUE）
您可以将每个url中的数据存储在列表中
extract_table <- function(webpage_url) {
  webpage <- xml2::read_html(webpage_url)  
  rvest::html_table(webpage, fill=TRUE)[[1]] 
}

list_data <- lapply(Sites, extract_table)

1.站点
到底包含什么？它是否包含整个URL（https://www.baseball-almanac.com/box-scores/boxscore.php?boxid=202007310COL
）或只是结尾的数字（202007310COL
）？2.html\u table
在该页上返回5个表，您想提取哪些数据？您的输出看起来如何？它包含整个URL。库（xml2）网页\u url我得到一个“x必须是长度为1的字符串”你知道如何解决这个问题吗？我试图找到一个解决这个错误的方法，但没有找到有效的解决方案。dput（head（Sites））
返回什么？结构（list（ï..https…www.barball.almanac.com.box.scores.boxscore.php.boxid.202007230COL=c("https://www.baseball-almanac.com/box-scores/boxscore.php?boxid=202007240COL",  "https://www.baseball-almanac.com/box-scores/boxscore.php?boxid=202007250COL",  "https://www.baseball-almanac.com/box-scores/boxscore.php?boxid=202007260COL",  "https://www.baseball-almanac.com/box-scores/boxscore.php?boxid=202007270COL",  "https://www.baseball-almanac.com/box-scores/boxscore.php?boxid=202007290COL“”），row.names=c（NA，6L），class=“data.frame”）
似乎您读取的数据不正确，您有URL作为列名。您是如何读取数据的？您需要在读取数据时包含header=FALSE
。此外，这些都不是完整的URL，因为它在开始时没有'https'
。修复上述错误后，您可能需要执行站点[[1]]
data <- do.call(rbind, list_data)