Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/80.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 使用.JSF搜索的页面中的scape数据_R_Web Scraping_Phantomjs_Rvest_Rselenium - Fatal编程技术网

R 使用.JSF搜索的页面中的scape数据

R 使用.JSF搜索的页面中的scape数据,r,web-scraping,phantomjs,rvest,rselenium,R,Web Scraping,Phantomjs,Rvest,Rselenium,我试图从瑞士行政法庭搜集信息,用于大学研究 网址是: 我对搜索完成后出现的表格中列出的数据感兴趣 不幸的是,没有.robots.txt文件。然而,该网页上的所有法令都对公众开放 我有一些html抓取的经验,我查阅了以下资源: 我的方法 我认为使用PhantomJS下载页面的html版本,使用rvest刮取下载的网站是一个很好的方法 我的问题 但是,我不知道如何获取页面的url,如果执行“空”搜索(通过单击“suchen”而不在搜索掩码中显示任何信息),则会显示57294个结果。 我想到了这

我试图从瑞士行政法庭搜集信息,用于大学研究

网址是: 我对搜索完成后出现的表格中列出的数据感兴趣

不幸的是,没有.robots.txt文件。然而,该网页上的所有法令都对公众开放

我有一些html抓取的经验,我查阅了以下资源:

我的方法

我认为使用PhantomJS下载页面的html版本,使用rvest刮取下载的网站是一个很好的方法

我的问题

但是,我不知道如何获取页面的url,如果执行“空”搜索(通过单击“suchen”而不在搜索掩码中显示任何信息),则会显示57294个结果。 我想到了这样的事情:

GET(url = "https://jurispub.admin.ch/publiws/",
      query=list(searchQuery="")) 
然而,这是行不通的


此外,我不知道如何让PhantomJS“点击”小箭头按钮下载下一页。

添加外部依赖项很好,但应该是最后的选择(IMO)

如果您不熟悉浏览器中的“开发人员工具”视图,请在回答这个问题之前对此进行一些研究。在进入搜索页面真正查看流程之前,您需要在新的浏览器会话中启动它

GET
不起作用,因为它是一个HTML表单,
元素使用
POST
请求(在大多数开发人员工具
网络
窗格中显示为
XHR
请求)。但是,这是一个制作拙劣的网站,对其自身而言过于复杂(几乎比Microsoft SharePoint网站更糟糕),当您转到“开始搜索”页面时,会有一些初始状态设置,并在流程的其余部分进行维护

我曾经对
POST
XHR
请求进行分类。TLDR执行此操作时,右键单击任何
POST
XHR
请求,找到“复制为卷曲”菜单项并选择它。然后,在剪贴板上仍保留该功能的情况下,按照curlconverter的自述和手册页面上的说明返回实际的
httr
功能。我真的不能保证带您看完这部分或回答这里的
curlconverter
问题

无论如何,要获取
httr
/
curl
为您维护一些cookie并获取关键会话变量,您需要在每次调用中传递一个新的R会话,我们需要从一个新的R会话开始,并使用
get
将刮取过程“初始化”到主搜索URL:

library(stringi) # Iprefer this for extracting matched strings
library(rvest)
library(httr)

primer <- httr::GET("https://jurispub.admin.ch/publiws/pub/search.jsf")
现在,我们假装我们正在提交一份表格。可能不需要所有这些隐藏的变量,但这是浏览器发送的。我通常尝试将它们缩减到只需要的部分,但这是您的项目,因此如果您愿意,请享受其中的乐趣:

httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id64",
    ice.event.captured = "form:_id63first",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "51", 
    ice.event.y = "336",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "form", 
    icefacesCssUpdates = "",
    `form:_id63` = "first",
    `form:_idcl` = "form:_id63first",
    ice.session = ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63first",
    rand = "0.38654987905551663\\n\\n"
  ),
  encode = "form"
) -> first_pg
现在我们有了第一个页面,我们需要它的数据。我不打算完全解决这个问题,但你们应该能够从下面的内容中推断出来。
POST
请求返回XML,页面上的javascript将其变成一个外观糟糕的表。我们将提取该表:

httr::content(first_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")
用肥皂泡、冲洗、重复任何其他专栏,但你可能需要做一些工作才能把它们弄得很好,这是留给你的练习(即,我不会回答关于它的问题)

而且,你会想知道你在刮削过程中的位置,所以我们需要抓住桌子底部的那一行:

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 1 bis 10. Seite 1 von 5,730. Resultat sortiert nach: Relevanz"
将其解析为结果的#和您所在的页面是留给读者的练习

现在,我们需要以编程方式单击“下一页”,直到完成。我将进行两次手动迭代来证明它是有效的,以防止出现“它不起作用”的评论。您应该编写一个迭代器或循环来遍历接下来的所有页面,并以您想要的方式保存数据

下一页(第一次迭代):

httr::POST(
url=”https://jurispub.admin.ch/publiws/block/send-receive-updates",
body=列表(
`$ice.submit.partial`=“true”,
ice.event.target=“形式:_id67”,
ice.event.captured=“形式:_id63next”,
ice.event.type=“onclick”,
ice.event.alt=“false”,
ice.event.ctrl=“false”,
ice.event.shift=“false”,
ice.event.meta=“false”,
ice.event.x=“330”,
ice.event.y=“559”,
ice.event.left=“true”,
ice.event.right=“false”,
表格=”,
IcefacesCSUpdate=“”,
`表格:_id63`=“下一步”,
`表格:_idcl`=“表格:_id63next”,
iceTooltipInfo=“tooltip\u id=form:resultable:7:tt\u ps;tooltip\u src\u id=form:resultable:7:\u id57;tooltip\u state=hide;tooltip\u x=846;tooltip\u y=433;cntxValue=“,
ice.session=ice\u session,
ice.view=“1”,
ice.focus=“形式:_id63next”,
rand=“0.17641832791084566\\n\\n”
),
encode=“表单”
)->下一页
httr::内容(下一页)%>%
xml\u find\u first(“//更新/更新/内容”)%>%
xml_text()%>%
读取\u html()->pg\u tbl
数据百分比
html_text()
##[1]“D-4059/2011”“D-4389/2006”“E-4019/2006”“D-4291/2008”“E-5642/2012”“E-7752/2010”
##[7]“D-7010/2014”“D-1551/2013”“C-7715/2010”“E-3187/2013”
html_节点(数据_tbl,xpath=“../td[2]/a”)%>%
html_attr(“href”)
##[1]“/publiws/download?decisionId=000bfd02-4da5-4bb2-a5d0-e9977bf8e464”
##[2]“/publiws/download?decisionId=000e2be1-6da8-47ff-b707-4a3537320a82”
##[3]“/publiws/download?decisionId=000fa961-ecb4-47d2-8ca3-72e8824c2c6b”
##[4]“/publiws/download?决策ID=0010a089-4f19-433e-b106-6d75833fae9a”
##[5]“/publiws/download?决策ID=00111bfc-3522-4a32-9e7a-fa2d9f171427”
##[6]“/publiws/download?decisionId=00126b65-b345-4988-826b-b213080caa45”
##[7]“/publiws/download?decisionId=00127944-5c88-43f6-9ef1-3c822288b0c7”
##[8]“/publiws/download?decisionId=00135a17-f1eb-4b61-9171-ac1d27fd3910”
##[9]“/publiws/download?decisionId=0014c6ea-c229-4129-bbe0-7411d34d9743”
##[10]“/publiws/download?决策ID=00167998-54d2-40a5-b02b-0c4546ac4760”
html_节点(pg_tbl,xpath=“../span[contains(@class,
html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
## [1] "A-3930/2013" "D-7885/2009" "E-5869/2012" "C-651/2011"  "F-2439/2017" "D-7416/2009"
## [7] "D-838/2011"  "C-859/2011"  "E-1927/2017" "E-2606/2011"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=0002b1f8-ea53-40bb-8e38-402d9f3fdfa9"
##  [2] "/publiws/download?decisionId=0002da8f-306e-4395-8eed-0b168df8634b"
##  [3] "/publiws/download?decisionId=0003ec45-50be-45b2-8a56-5c0d866c2603"
##  [4] "/publiws/download?decisionId=000508c2-c852-4aef-bc32-3385ddbbe88a"
##  [5] "/publiws/download?decisionId=0006fbb9-228a-4bdc-ac8c-52db67df3b34"
##  [6] "/publiws/download?decisionId=0008a971-6795-434d-90d4-7aeb1961606b"
##  [7] "/publiws/download?decisionId=00099619-519c-4c8f-9cea-a16ed9ab9fd8"
##  [8] "/publiws/download?decisionId=0009ac38-f2b0-4733-b379-05682473b5d9"
##  [9] "/publiws/download?decisionId=000a4e0f-b2a2-483b-a49f-6ad12f4b7849"
## [10] "/publiws/download?decisionId=000be307-37b1-4d46-b651-223ceec9e533"
html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 1 bis 10. Seite 1 von 5,730. Resultat sortiert nach: Relevanz"
httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id67",
    ice.event.captured = "form:_id63next",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "330", 
    ice.event.y = "559",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "", 
    icefacesCssUpdates = "",
    `form:_id63` = "next",
    `form:_idcl` = "form:_id63next",
    iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
    ice.session =  ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63next",
    rand = "0.17641832791084566\\n\\n"
  ),
  encode = "form"
) -> next_pg

httr::content(next_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
##  [1] "D-4059/2011" "D-4389/2006" "E-4019/2006" "D-4291/2008" "E-5642/2012" "E-7752/2010"
##  [7] "D-7010/2014" "D-1551/2013" "C-7715/2010" "E-3187/2013"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=000bfd02-4da5-4bb2-a5d0-e9977bf8e464"
##  [2] "/publiws/download?decisionId=000e2be1-6da8-47ff-b707-4a3537320a82"
##  [3] "/publiws/download?decisionId=000fa961-ecb4-47d2-8ca3-72e8824c2c6b"
##  [4] "/publiws/download?decisionId=0010a089-4f19-433e-b106-6d75833fae9a"
##  [5] "/publiws/download?decisionId=00111bfc-3522-4a32-9e7a-fa2d9f171427"
##  [6] "/publiws/download?decisionId=00126b65-b345-4988-826b-b213080caa45"
##  [7] "/publiws/download?decisionId=00127944-5c88-43f6-9ef1-3c822288b0c7"
##  [8] "/publiws/download?decisionId=00135a17-f1eb-4b61-9171-ac1d27fd3910"
##  [9] "/publiws/download?decisionId=0014c6ea-c229-4129-bbe0-7411d34d9743"
## [10] "/publiws/download?decisionId=00167998-54d2-40a5-b02b-0c4546ac4760"

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 11 bis 20. Seite 2 von 5,730. Resultat sortiert nach: Relevanz"
httr::POST(
  url = "https://jurispub.admin.ch/publiws/block/send-receive-updates",
  body = list(
    `$ice.submit.partial` = "true",
    ice.event.target = "form:_id67",
    ice.event.captured = "form:_id63next",
    ice.event.type = "onclick",
    ice.event.alt = "false",
    ice.event.ctrl = "false",
    ice.event.shift = "false",
    ice.event.meta = "false",
    ice.event.x = "330", 
    ice.event.y = "559",
    ice.event.left = "true",
    ice.event.right = "false",
    form = "", 
    icefacesCssUpdates = "",
    `form:_id63` = "next",
    `form:_idcl` = "form:_id63next",
    iceTooltipInfo = "tooltip_id=form:resultTable:7:tt_ps; tooltip_src_id=form:resultTable:7:_id57; tooltip_state=hide; tooltip_x=846; tooltip_y=433; cntxValue=",
    ice.session =  ice_session,
    ice.view = "1", 
    ice.focus = "form:_id63next",
    rand = "0.17641832791084566\\n\\n"
  ),
  encode = "form"
) -> next_pg

httr::content(next_pg) %>% 
  xml_find_first("//updates/update/content") %>% 
  xml_text() %>% 
  read_html() -> pg_tbl

data_tbl <- html_node(pg_tbl, xpath=".//table[contains(., 'Dossiernummer')]")

html_nodes(data_tbl, xpath=".//td[1]/a") %>% 
  html_text()
##  [1] "D-3974/2010" "D-5847/2009" "D-4241/2015" "E-3043/2010" "D-602/2016"  "C-2065/2008"
##  [7] "D-2753/2007" "E-2446/2010" "C-1124/2015" "B-7400/2006"

html_nodes(data_tbl, xpath=".//td[2]/a") %>% 
  html_attr("href")
##  [1] "/publiws/download?decisionId=00173ef1-2900-49d4-b7d3-39246e552a70"
##  [2] "/publiws/download?decisionId=001a344c-86b7-4f32-97f7-94d30669a583"
##  [3] "/publiws/download?decisionId=001ae810-300d-4291-8fd0-35de720a6678"
##  [4] "/publiws/download?decisionId=001c2025-57dd-4bc6-8bd6-eedbd719a6e3"
##  [5] "/publiws/download?decisionId=001c44ba-e605-455d-9609-ed7dffb17adc"
##  [6] "/publiws/download?decisionId=001c6040-4b81-4137-a6ee-bad5a5019e71"
##  [7] "/publiws/download?decisionId=001d0811-a5c2-4856-aef3-51a44f7f2b0e"
##  [8] "/publiws/download?decisionId=001dbf61-b1b8-468d-936e-30b174a8bec9"
##  [9] "/publiws/download?decisionId=001ea85a-0765-4a1f-9b81-3cecb9f36b31"
## [10] "/publiws/download?decisionId=001f2e34-9718-4ef7-a60c-e6bbe208003b"

html_node(pg_tbl, xpath=".//span[contains(@class, 'iceOutFrmt')]") %>% 
  html_text()
## [1] "57,294 Entscheide gefunden, zeige 21 bis 30. Seite 3 von 5,730. Resultat sortiert nach: Relevanz"
library(wdman) # for managing the Selenium server d/l
library(RSelenium) # for getting a connection to the Selenium server
library(seleniumPipes) # for better navigation & scraping idioms
selServ <- selenium() 
selServ$log()$stderr 
sel <- remoteDr(browserName = "chrome", port = 4567) 
sel %>% 
  go("https://jurispub.admin.ch/publiws/pub/search.jsf")
sel %>% 
  findElement("name", "form:searchSubmitButton") %>%  # find the submit button 
  elementClick() # click it
sel %>% 
  getPageSource() %>% # like read_html()
  html_node("table.iceDatTbl") -> dtbl  # this is the data table

html_nodes(dtbl, xpath=".//td[@class='iceDatTblCol1']/a") %>% # get doc ids
  html_text()

html_nodes(dtbl, xpath=".//td[@class='iceDatTblCol2']/a[contains(@href, 'publiws')]") %>% 
  html_attr("href") # get pdf links
sel %>% 
  getPageSource() %>% 
  html_node("span.iceOutFrmt") %>% 
  html_text() # the total items / pagination info
sel %>%
  findElement("xpath", ".//img[contains(@src, 'arrow-next')]/../../a") %>% 
  elementClick() # go to next page
selServ$stop()