Html Web刮取:从下拉列表中选择字段,提取结果数据

Html Web刮取:从下拉列表中选择字段,提取结果数据,html,asp.net,r,rvest,Html,Asp.net,R,Rvest,尝试在R中做一些网页拉屎,可能需要一些帮助 我想提取本页表格中的数据 但是我想首先从最左边的下拉列表中选择County,然后从下一个下拉列表中选择Alameda County CA,然后在表中刮取数据 到目前为止,这就是我所知道的,但我想我知道为什么它不起作用-rvest表单函数适合填写基本表单,而不是从.aspx的下拉列表中进行选择。到处寻找我想做的事情的例子,但结果是空的 library(rvest) url <-"http://droughtmonitor.unl.edu

尝试在R中做一些网页拉屎,可能需要一些帮助

我想提取本页表格中的数据

但是我想首先从最左边的下拉列表中选择County,然后从下一个下拉列表中选择Alameda County CA,然后在表中刮取数据

到目前为止,这就是我所知道的,但我想我知道为什么它不起作用-rvest表单函数适合填写基本表单,而不是从.aspx的下拉列表中进行选择。到处寻找我想做的事情的例子,但结果是空的

library(rvest)
url       <-"http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx"       
pgsession <-html_session(url)               
pgform    <-html_form(pgsession)[[1]]       

filled_form <- set_values(pgform,
                      `#atype_chosen span` = "County", 
                      `#asel_chosen span` = "Alameda Count (CA)") 
submit_form(pgsession,filled_form)
无论如何,这给了我一个错误:未知字段名:atype_selected span,asel_selected span。我有点明白了……我要求R在不打开下拉列表的情况下将County输入框中,这是行不通的


如果有人能给我指出正确的方向,我将不胜感激。

我监控了浏览器在选择您所在的县时发出的请求,并使用这些信息创建了此文件。它可以让你得到你的数据,只是以一种不同于你处理数据的方式。。。有效负载中的面积参数适用于不同的县

更新:我已经添加了代码来获取县列表和代码,这样您就可以选择您想要从中获取数据的任何县

library("httr")

# start by getting the counties and their codes...
url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnAOI"
headers <- add_headers(
  "Accept" = "application/json, text/javascript, */*; q=0.01",
  "Accept-Encoding" = "gzip, deflate",
  "Accept-Language" = "en-US,en;q=0.8",
  "Content-Length" = "16",
  "Content-Type" = "application/json; charset=UTF-8",
  "Host" = "droughtmonitor.unl.edu",
  "Origin" = "http://droughtmonitor.unl.edu",
  "Proxy-Connection" = "keep-alive",
  "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
  "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
  "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body="{'aoi':'county'}", headers, encode="json")
tmp <- content(a)[[1]]
county_df <- data.frame(text=unname(unlist(sapply(tmp, "[", "Text"))),
                  value=unname(unlist(sapply(tmp, "[", "Value"))),
                  stringsAsFactors=FALSE)

# use the code for whatever county you want in the payload below...

url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM"
payload <- "{'area':'06001', 'type':'county', 'statstype':'1'}"
headers <- add_headers(
                "Host" = "droughtmonitor.unl.edu",
                "Proxy-Connection" = "keep-alive",
                "Content-Length" = "50",
                "Accept" = "application/json, text/javascript, */*; q=0.01",
                "Origin" = "http://droughtmonitor.unl.edu",
                "X-Requested-With" = "XMLHttpRequest",
                "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
                "Content-Type" = "application/json; charset=UTF-8",
                "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
                "Accept-Encoding" = "gzip, deflate",
                "Accept-Language" = "en-US,en;q=0.8",
                "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body=payload, headers, encode="json")
tmp <- content(a)[[1]]
df <- data.frame(date=unname(unlist(sapply(tmp, "[", "Date"))),
                 d0=unname(unlist(sapply(tmp, "[", "D0"))),
                 d1=unname(unlist(sapply(tmp, "[", "D1"))),
                 d2=unname(unlist(sapply(tmp, "[", "D2"))),
                 d3=unname(unlist(sapply(tmp, "[", "D3"))),
                 d4=unname(unlist(sapply(tmp, "[", "D4"))),
                 stringsAsFactors=FALSE)

我的眼睛里有一滴泪…太…美了。真是好东西@cory。非常感谢。嘿,科里!这个答案看起来很棒,但我不知道它是什么意思。对于一个网络抓取的初学者,你有什么资源可以帮我理解你上面写的代码吗@cory@William在浏览器中按F12键->网络选项卡->查看站点发出的请求。在R代码中使用上述请求信息。你必须对POST做一些研究,并获得一些请求来理解其中的大部分内容。