Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用XML和ReadHTMLTable刮取多个页面?_Xml_R_Web Scraping - Fatal编程技术网

如何使用XML和ReadHTMLTable刮取多个页面?

如何使用XML和ReadHTMLTable刮取多个页面?,xml,r,web-scraping,Xml,R,Web Scraping,我正在使用XML包将芝加哥马拉松比赛的结果刮到CSV中。问题是,该网站在一个页面上只能显示1000名跑步者,因此我不得不抓取多个页面。到目前为止,我编写的脚本适用于第一页: rm(list=ls()) library(XML) page_numbers <- 1:1429 urls <- paste( "http://results.public.chicagomarathon.com/2011/index.php?page", page_numbers, sep = "="

我正在使用XML包将芝加哥马拉松比赛的结果刮到CSV中。问题是,该网站在一个页面上只能显示1000名跑步者,因此我不得不抓取多个页面。到目前为止,我编写的脚本适用于第一页:

rm(list=ls())

library(XML)

page_numbers <- 1:1429
urls <- paste(
"http://results.public.chicagomarathon.com/2011/index.php?page", 
page_numbers, 
sep = "="
)

tables <-(for i in page_numbers){
readHTMLTable(urls)
}
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

times <- tables[[which.max(n.rows)]]
rm(list=ls())
库(XML)

页码将页码添加到每个URL

page_numbers <- 1:1429
urls <- paste(
  "http://results.public.chicagomarathon.com/2011/index.php?pid=list&page", 
  page_numbers, 
  sep = "="
)

页码这里有一个有效的方法。您的方法失败的原因是您没有描述整个网页。稍加修改,可以为每个页面提供正确的url格式,之后一切都会就绪

url1 = 'http://results.public.chicagomarathon.com/2011/index.php?page='
url3 = '&content=list&event=MAR&num_results=25'

# GET TABLE FROM PAGE NUMBER
getPage <- function(page){
  require(XML)
  url = paste(url1, page, url3, sep = "")
  tab = readHTMLTable(url, stringsAsFactors = FALSE)[[1]]
  return(tab)
}

require(plyr)
# for some reason ldply fails, hence the llply + rbind workaround
pages    = llply(1:10, getPage, .progress = 'text') 
marathon = do.call('rbind', pages)
url1=表[[which.max(n.rows)]中的错误:尝试选择少于一个元素