Html 使用R-can';无法访问某些节点
我在网上有大量取水许可证,我想从中提取一些数据。比如说Html 使用R-can';无法访问某些节点,html,r,web-scraping,rvest,Html,R,Web Scraping,Rvest,我在网上有大量取水许可证,我想从中提取一些数据。比如说 url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1" 或使用css选择器: url %>% read_html() %>% html_nodes(css = "#main") %>% html_nodes(css = "div") %>% html_nodes(css = "h1") %>
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
或使用css选择器:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
到目前为止,一切都很好,但我真正想要的信息被埋藏得更深了,我似乎无法找到它。例如,客户端名称字段(“在本例中为Killermont Station Limited”)具有以下xpath:
clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
但是它没有给我任何关于在xpath中使用替代前缀的线索(如果我知道html,这可能很明显)
我的朋友指出,一些文档是用javascript(ajax)编写的,这可能也是问题的一部分。也就是说,上面我想说的内容显示在html中,但它位于一个名为“div.ajax-block”的节点中
css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
有人能帮忙吗?谢谢 非常令人不安的是,大多数(如果不是全部的话)SoR贡献者在抓取时默认在简短的“答案”中“使用重量级的第三方依赖关系”。99%的时间你不需要硒。你只需要锻炼一下小的灰色细胞 首先,页面异步加载内容的一个重要线索是出现的等待微调器。第二个是在您的代码片段中,
div
实际上有一部分选择器名称,其中包含ajax
。XHR请求正在发挥作用的信号装置
如果在浏览器中打开“开发人员工具”并重新加载页面,然后转到“网络”,然后单击“XHR”选项卡,您将看到:
页面上的大多数“真实”数据都是动态加载的。我们可以编写模仿浏览器调用的httr
调用
然而
我们首先需要对主页进行一次GET
调用,为我们提供一些cookie,然后找到一个用于防止网站被滥用的每生成会话令牌。它是使用JavaScript定义的,因此我们将使用V8
包对其进行评估。我们可以使用正则表达式来查找字符串。做你喜欢做的事
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
以下是“同意概述”部分:
以下是“同意条件”:
httr::GET(
url=”https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::添加_头(
Referer=”https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority=“www.ecan.govt.nz”,
`X-request-With`=“XMLHttpRequest”
),
httr::set_cookies(
monsido=monsido_代币
)
)->res
httr::含量(分辨率)%>%
as.character()%>%
子串(1300)%
猫()
##
##
##
##-
##
##
##1
以下是“同意相关”:
httr::GET(
url=”https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::添加_头(
Referer=”https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority=“www.ecan.govt.nz”,
`X-request-With`=“XMLHttpRequest”
),
httr::set_cookies(
monsido=monsido_代币
)
)->res
httr::含量(分辨率)%>%
as.character()%>%
子串(1300)%
猫()
##
##
##没有相关文件
##
##
##
##
##
##
##
##
##关系
##记录
以下是“工作流程:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
httr::GET(
url=”https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::添加_头(
Referer=”https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority=“www.ecan.govt.nz”,
`X-request-With`=“XMLHttpRequest”
),
httr::set_cookies(
monsido=monsido_代币
)
)->res
httr::内容(res)
##{xml_文档}
##
##[1]没有工作流
以下是“同意流限制”:
httr::GET(
url=”https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::添加_头(
Referer=”https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority=“www.ecan.govt.nz”,
`X-request-With`=“XMLHttpRequest”
),
httr::set_cookies(
monsido=monsido_代币
)
)->res
httr::含量(分辨率)%>%
as.character()%>%
子串(1300)%
猫()
##
##
##
##
##低流量场地
##今日流量(m3/s)
##
您仍然需要解析HTML,但现在您只需使用普通的R包即可完成所有工作。首先,您从该页面获取数据是否合法?是的,这都是公共信息。它是动态页面,使用Selenium。您将如何在RSelenium中提取数据?我快速查看了一下,似乎非常复杂!请查看我的答案。这“使用硒”“疯狂就是疯狂。谢谢,太棒了!工作非常完美,现在我开始用文本模式匹配将我的头撞到墙上。你能简单地解释一下如何选择GET的参数吗?在这种情况下,它们工作得很好,但我不认为我可以复制它,而且R中的帮助文件有点不透明。如果这不是时间敏感的,请让我明天把它放到GitHub中,我会在这里删除一个链接,我们可以解决GitHub的问题。酷。如果你能把你对这个答案的最初评论作为一个问题写在我的文章里,我会在明天写一个便条。今晚是我大学一年级儿子回家过感恩节的最后一晚,所以我明天就可以满怀热情地开始了。
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>