从没有唯一URL的网页中删除R中的Javascript呈现内容
我想从网站上获取南非彩票抽奖的历史结果(特别是总池大小、总销售额等)。默认情况下,可以看到最近十次绘图的结果链接,也可以选择一个日期范围来拉取一组更大的绘图链接(每页仍然只显示十次) 将鼠标悬停在浏览器中的某个链接上,例如“LOTTO DRAW 2012”,我们看到从没有唯一URL的网页中删除R中的Javascript呈现内容,javascript,r,web-scraping,Javascript,R,Web Scraping,我想从网站上获取南非彩票抽奖的历史结果(特别是总池大小、总销售额等)。默认情况下,可以看到最近十次绘图的结果链接,也可以选择一个日期范围来拉取一组更大的绘图链接(每页仍然只显示十次) 将鼠标悬停在浏览器中的某个链接上,例如“LOTTO DRAW 2012”,我们看到javascript:void()因此很明显,绘制结果将使用Javascript呈现。在阅读了一篇文章的建议后,我意识到我需要打开Google Chrome开发者工具,然后打开网络标签,然后点击抽奖“LOTTO draw 2012”的
javascript:void()代码>因此很明显,绘制结果将使用Javascript呈现。在阅读了一篇文章的建议后,我意识到我需要打开Google Chrome开发者工具,然后打开网络标签,然后点击抽奖“LOTTO draw 2012”的链接。当我这么做的时候,我可以看到这是一个被称为
当我右键单击启动器并选择“复制响应”时,我可以在一个“drawDetails”对象中看到所需的数据,该对象看起来是JSON代码
{"code":200,"message":"OK","data":{"drawDetails":{"drawNumber":"2012","drawDate":"2020\/04\/11","nextDrawDate":"2020\/04\/15","ball1":"48","ball2":"6","ball3":"43","ball4":"41","ball5":"25","ball6":"45","bonusBall":"38","div1Winners":"1","div1Payout":"10546013.8","div2Winners":"0","div2Payout":"0","div3Winners":"28","div3Payout":"7676.4","div4Winners":"62","div4Payout":"2751.4","div5Winners":"1389","div5Payout":"206.3","div6Winners":"1872","div6Payout":"133","div7Winners":"28003","div7Payout":"50","div8Winners":"20651","div8Payout":"20","rolloverAmount":"0","rolloverNumber":"0","totalPrizePool":"13280236.5","totalSales":"11610950","estimatedJackpot":"2000000","guaranteedJackpot":"0","drawMachine":"RNG2","ballSet":"RNG","status":"published","winners":52006,"millionairs":1,"gpwinners":"52006","wcwinners":"0","ncwinners":"0","ecwinners":"0","mpwinners":"0","lpwinners":"0","fswinners":"0","kznwinners":"0","nwwinners":"0"},"totalWinnerRecord":{"lottoMillionairs":28716702,"lottoWinners":337285646,"ithubaMillionairs":135763,"ithubaWinners":305615802}},"videoData":[{"id":"1049","listid":"1","parentid":"1","videosource":"youtube","videoid":"chHfFxVi9QI","imageurl":"","title":"LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW 2012 (11 APRIL 2020)","description":"","custom_imageurl":"","custom_title":"","custom_description":"","specialparams":"","lastupdate":"0000-00-00 00:00:00","allowupdates":"1","status":"0","isvideo":"1","link":"https:\/\/www.youtube.com\/watch?v=chHfFxVi9QI","ordering":"10001","publisheddate":"2020-04-11 20:06:17","duration":"182","rating_average":"0","rating_max":"0","rating_min":"0","rating_numRaters":"0","statistics_favoriteCount":"0","statistics_viewCount":"329","keywords":"","startsecond":"0","endsecond":"0","likes":"6","dislikes":"0","commentcount":"0","channel_username":"","channel_title":"","channel_subscribers":"9880","channel_subscribed":"0","channel_location":"","channel_commentcount":"0","channel_viewcount":"0","channel_videocount":"1061","channel_description":"","channel_totaluploadviews":"0","alias":"lotto-lotto-plus-1-and-lotto-plus-2-draw-2012-11-april-2020","rawdata":"","datalink":"https:\/\/www.googleapis.com\/youtube\/v3\/videos?id=chHfFxVi9QI&part=id,snippet,contentDetails,statistics&key=AIzaSyC1Xvk2GUdb_N3UiFtjsgZ-uMviJ_8MFZI"}]}
这是一个POST类型的请求,因此我尝试跟随,但找不到表示随表单提交的数据的onclick
值。此外,“LOTTO DRAW 2012”的请求URL与“LOTTO DRAW 2011”的请求URL相同,因此URL本身传递的特定抽奖没有唯一标识符。因此,我不清楚如何对特定抽签结果提出独特要求
因此,一个较小的问题是,给定一个特定的彩票抽奖号码或抽奖日期,如何找到用于对该抽奖相关数据进行POST请求的唯一标识符
更大的问题是,如果能够获得所有历史绘图的唯一标识符,如何依次为所有历史绘图生成JSON drawDetails对象,或者以其他方式完成刮削操作?你说得对-页面上的内容通过ajax请求由javascript更新。服务器返回一个json字符串以响应http POST请求。对于POST请求,服务器的响应不仅取决于您请求的url,还取决于您发送给服务器的消息体。在本例中,您的主体是一个简单的表单,包含3个字段:gameName
,它总是LOTTO
,isAjax
,它总是true
,以及drawNumber
,这是您想要更改的字段
如果使用的是httr
,则可以在POST
函数的body
参数中将这些字段指定为命名列表
在获得每个绘图的响应后,您将希望使用库(如jsonlite
)将json解析为R友好格式,如列表或数据帧。从这个特定json的结构来看,提取组件$data$drawDetails
并使其成为一行数据帧最有意义。这将允许您将多个绘图绑定到单个数据帧中
这里有一个函数可以为您完成所有这些:
乐透详情1 2009 2020/04/01 2020/04/04 51 15 7 32 42 45
#> 2 2010 2020/04/04 2020/04/08 43 4 21 24 10 3
#> 3 2011 2020/04/08 2020/04/11 42 43 8 18 2 29
#> 4 2012 2020/04/11 2020/04/15 48 6 43 41 25 45
#>Bonuspall Div1赢家Div1付款Div2赢家Div2付款Div3赢家
#> 1 1 0 0 0 0 21
#> 2 22 0 0 0 0 31
#> 3 34 0 0 0 0 21
#> 4 38 1 10546013.8 0 0 28
#>分区3付款分区4赢家分区4付款分区5赢家分区5付款分区6赢家
#> 1 8455.3 60 2348.7 1252 189 1786
#> 2 6004.3 71 2080.6 1808 137.3 2352
#> 3 8584.5 60 2384.6 1405 171.1 2079
#> 4 7676.4 62 2751.4 1389 206.3 1872
#>Div6付款Div7赢家Div7付款Div8赢家Div8付款滚动平均金额
#> 1 115.2 24664 50 19711 20 3809758.17
#> 2 91.7 35790 50 25981 20 5966533.86
#> 3 100.5 27674 50 21895 20 8055430.87
#> 4 133 28003 50 20651 20 0
#>rolloverNumber totalPrizePool totalSales estimatedJackpot
#> 1 2 6198036.67 9879655 6000000
#> 2 3 9073426.56 11696905 8000000
#> 3 4 10649716.37 10406895 10000000
#> 4 0 13280236.5 11610950 2000000
#>保证头奖抽金机球组状态赢家百万富翁
#>1 0 RNG2 RNG已发布47494 0
#>2 0 RNG2 RNG已发布66033 0
#>3 0 RNG2 RNG已发布53134 0
#>4 0 RNG2 RNG已发布52006 1
#>gpwinners wcwinners ncwinners ecwinners mpwinners lpwinners fswinners
#> 1 47494 0 0 0 0 0 0
#> 2 66033 0 0 0 0 0 0
#> 3 53134 0 0 0 0 0 0
#> 4 52006 0 0 0 0 0 0
#>克兹尼获奖者
#> 1 0 0
#> 2 0 0
#> 3 0 0
#> 4 0 0
由(v0.3.0)于2020年4月13日创建的问题已经有了一个令人满意的答案(见上文),我已经接受了。我同时得出了一个几乎相同的解决方案;我在这里添加它只是因为它明确地涵盖了所有可用的绘图编号,并将自动检测最新的绘图编号
theurl <- "https://www.nationallottery.co.za/index.php?task=results.redirectPageURL&Itemid=265&option=com_weaver&controller=lotto-history"
x <- rvest::html_text(xml2::read_html(theurl))
preceding_string <- "LOTTO, LOTTO PLUS 1 AND LOTTO PLUS 2 DRAW "
drawnums <- as.integer(vapply(gregexpr(preceding_string, x)[[1]] + nchar(preceding_string),
function(k) substr(x, start = k, stop = k + 3), NA_character_))
drawnumrange <- 1506:max(drawnums)
response <- lapply(drawnumrange, function(d) httr::POST(url = theurl,
body = list(gameName = "LOTTO", drawNumber = as.character(d), isAjax =
"true"), encode = "form"))
jsondat <- lapply(response, function(r) jsonlite::parse_json(r)$data$drawDetails)
lottotable <- as.data.frame(do.call(rbind, jsondat))
numericcols <- c(1, 4:32, 36:37)
lottotable[numericcols] <- sapply(lottotable[numericcols], as.numeric)
xlsx::write.xlsx2(lottotable[1:37], "lottotable.xlsx", row.names = FALSE)