使用R接受cookies下载PDF文件_R_Curl_Web Scraping_Httr

使用R接受cookies下载PDF文件

r curl web-scraping

使用R接受cookies下载PDF文件,r,curl,web-scraping,httr,R,Curl,Web Scraping,Httr,当我试图下载PDF时，我被cookies卡住了例如，如果我在考古学数据服务上有一个PDF文档，它将解析为但它确实重定向到其他链接 library（httr）将处理解析DOI的问题，我们可以使用library（XML）从登录页提取pdf URL，但我一直在获取pdf本身如果我这样做： download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/

当我试图下载PDF时，我被cookies卡住了

例如，如果我在考古学数据服务上有一个PDF文档，它将解析为但它确实重定向到其他链接

library（httr）

将处理解析DOI的问题，我们可以使用

library（XML）

从登录页提取pdf URL，但我一直在获取pdf本身

如果我这样做：

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")

然后我收到一个HTML文件，该文件与

在尝试答案时，我发现：

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,
# generating the appropriate cookies

POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)

resp <- GET(download, query = values)

# write the content of the download to a binary file

writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

看来这个网站的cookie情况很复杂。对于英国数据提供商来说，这种cookie复杂性似乎并不罕见：

我如何才能使用R通过此网站上的cookies？

您的请求已被听取

在这些页面之间有很多javascript，这使得试图通过

httr

rvest

进行解密有些烦人。尝试使用RSelenium。这在OSX10.11.2、R3.2.3和Firefox上都有效

library(RSelenium)

# check if a sever is present, if not, get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),
  `browser.download.dir` = getwd(),
  `pdfjs.disabled` = TRUE,
  `plugin.scan.plid.all` = FALSE,
  `plugin.scan.Acrobat` = "99.0",
  `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))

此答案来自电子邮件，应其要求发布在此处：

这将允许您下载PDF：

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl, .opts = list(cookie = "ADSCOPYRIGHT=YES"))
writeBin(pdfData, "test2.pdf")

appURL在Ubuntu 14.04、R3.2.3和Firefox上进行了尝试dr$open（）
reports[1]“Connecting to remote server”RCurl调用中未定义的错误。queryRD（paste0（serverURL，/session”），“POST”，qdata=toJSON（serverOpts））中的错误：
这一直是我对Selenium（不一定是R pkg）的最大挑剔。要在Windows、OS X和*nix之间实现一致性非常困难。希望大家能对此进行补充（我所有的*nix系统都是配置非常精简的无头服务器，我今晚不想尝试掌握phantomjs驱动程序：-）好的，找到了如何在我的计算机上工作的方法。我必须首先使用java-jar selenium-server-standalone-2.48.0.jar
手动启动selenium单机服务器。然后我就可以连接了。这比预期的要花费更多的精力（初始配置文件设置不起作用，但上面的设置起了作用）。考虑到疯狂的Windows斜杠，您可能需要更好地引用目录路径，但我可以确认上述内容在2台Mac电脑上有效。我可以确认这在我的Ubuntu上有效。只要跳过checkForServer（）
步骤，就可以尝试下载独立服务器，而服务器已经在java-jar-selenium-server-standalone-2.48.0.jar之后运行。
# close the session
dr$close()

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl, .opts = list(cookie = "ADSCOPYRIGHT=YES"))
writeBin(pdfData, "test2.pdf")

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
appData <- getURL(appURL, curl = curl)

# get the necessary elements for the POST that is initiated when the ACCEPT button is pressed

doc <- htmlParse(appData)
appAttrs <- doc["//input", fun = xmlAttrs]
postData <- lapply(appAttrs, function(x){data.frame(name = x[["name"]], value = x[["value"]]
                                                    , stringsAsFactors = FALSE)})
postData <- do.call(rbind, postData)

# post your acceptance
postURL <- "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid="
# get jsessionid
jsessionid <- unlist(strsplit(getCurlInfo(curl)$cookielist[1], "\t"))[7]

searchData <- postForm(paste0(postURL, jsessionid), curl = curl,
                       "j_id10" = "j_id10",
                       from = postData[postData$name == "from", "value"],
                       "javax.faces.ViewState" = postData[postData$name == "javax.faces.ViewState", "value"],
                       "j_id10:_idcl" = "j_id10:agreeButton"
                       , binary = TRUE
)
con <- file("test.pdf", open = "wb")
writeBin(searchData, con)
close(con)


Pressing the ACCEPT button on the page you gave initiates a POST to "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid=......" via some javascript.
This post then redirects to the page with the pdf having given some additional cookies.

Checking our cookies we see:

> getCurlInfo(curl)$cookielist
[1] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tJSESSIONID\t3d249e3d7c98ec35998e69e15d3e" 
[2] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tSSOSESSIONID\t3d249e3d7c98ec35998e69e15d3e"
[3] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tADSCOPYRIGHT\tYES"          

so it would probably be sufficient to set that last cookie to start with (indicating we accept copyright)