如何使用R从html/aspx网页登录并提取表
我试图从中提取一个表 <>网页是中文的,但基本上,你可以在网页中间的大蓝色按钮上方的框中键入你的日志。登录后,表将出现在页面的中间。注意:在/articlenew.html中,登录只需要用户名和密码。没有别的了 认证后,网站的标题如下所示:如何使用R从html/aspx网页登录并提取表,html,asp.net,r,Html,Asp.net,R,我试图从中提取一个表 网页是中文的,但基本上,你可以在网页中间的大蓝色按钮上方的框中键入你的日志。登录后,表将出现在页面的中间。注意:在/articlenew.html中,登录只需要用户名和密码。没有别的了 认证后,网站的标题如下所示: Request URL:http://www.sxcoal.com/user/login.aspx Request Method:POST Status Code:302 Found Request Headersview source Accept:text/
Request URL:http://www.sxcoal.com/user/login.aspx
Request Method:POST
Status Code:302 Found
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en,en-GB;q=0.8,zh;q=0.6,zh-CN;q=0.4
Connection:keep-alive
Content-Length:39
Content-Type:application/x-www-form-urlencoded
Cookie:the_cookies
Host:www.sxcoal.com
Origin:http://www.sxcoal.com
Referer:http://www.sxcoal.com/coal/3478186/articlenew.html
User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Form Dataview sourceview URL encoded
username:myusername
password:mypassword
Response Headersview source
Cache-Control:private
Content-Length:167
Content-Type:text/html; charset=gb2312
Date:Thu, 14 Nov 2013 01:06:00 GMT
Location:http://www.sxcoal.com/coal/3478186/articlenew.html
Server:Microsoft-IIS/7.0
Set-Cookie:s_info=zhuhaiqinfa|15816; domain=sxcoal.com; path=/
X-AspNet-Version:2.0.50727
X-Powered-By:ASP.NET
我已尝试使用所示的方法。但是,由于某些原因,R无法登录。我猜是/login.aspx http:[DELETE]//www.[DELETE]sxcoal.[DELETE]com/user/login.[DELETE]aspx
[很抱歉,我没有足够的“声誉”来发布更多链接。]。我把/login.aspx的标题放在问题的末尾
这是我使用的代码
library(RCurl)
mycurl <- getCurlHandle()
agent <- "Mozilla/5.0"
curlSetOpt(cookiejar = "", followlocation = TRUE, useragent = agent, autoreferer = TRUE, curl = mycurl)
html <- getURL('http://www.sxcoal.com/user/login.aspx', curl = mycurl)
viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
eventvalidation <- as.character(sub('.*id="__EVENTVALIDATION" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
##checkcode <- ??????????????? ## can't define it as it changes
params <- list(
"txtuser" = "myusername",
"txtpass" = "mypassword",
"__VIEWSTATE" = viewstate,
"__EVENTVALIDATION" = eventvalidation,
"CheckCode" = checkcode,
"Button2" = ""
)
html <- postForm('http://www.sxcoal.com/user/login.aspx', .params = params, curl = mycurl)
Request URL:http://www.sxcoal.com/user/login.aspx
Request Method:POST
Status Code:302 Found
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en,en-GB;q=0.8,zh;q=0.6,zh-CN;q=0.4
Connection:keep-alive
Content-Length:234
Content-Type:application/x-www-form-urlencoded
Cookie:the_cookies
Host:www.sxcoal.com
Origin:http://www.sxcoal.com
Referer:http://www.sxcoal.com/user/login.aspx
User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Form Dataview sourceview URL encoded
__VIEWSTATE:whatever_it_is
txtuser:myusername
txtpass:mypassword
CheckCode:04854
Button2:
__EVENTVALIDATION:whatever_it_it_2
Response Headersview source
Cache-Control:private
Content-Length:170
Content-Type:text/html; charset=gb2312
Date:Thu, 14 Nov 2013 01:09:57 GMT
Location:http://www.sxcoal.com/?aspxerrorpath=/user/login.aspx
Server:Microsoft-IIS/7.0
X-AspNet-Version:2.0.50727
X-Powered-By:ASP.NET