如何使用R在网页上抓取点击信息？_R_Web Scraping_Rvest

如何使用R在网页上抓取点击信息？

r web-scraping

如何使用R在网页上抓取点击信息？,r,web-scraping,rvest,R,Web Scraping,Rvest,我正在尝试从以下网站中获取电话号码：。可以使用带有选择器的软件包刮取电话号码“id\u raw\：：n第n个子（1）span+div strong（由[selectorGadget]建议）问题是，单击其掩码后可以获得信息。因此，我不得不打开一个会话，提供一个点击，然后抓取信息编辑顺便说一句，它不是一个链接。看看源代码。我有一个问题，因为我是一个普通的R用户，而不是javascript程序员这里有一个使用、（）和phantomjs的解决方案然而，我不确定它有多有用，因为它在我的机器上运行得

我正在尝试从以下网站中获取电话号码：。可以使用带有选择器的软件包刮取电话号码

“id\u raw\：：n第n个子（1）span+div strong

（由[selectorGadget]建议）

问题是，单击其掩码后可以获得信息。因此，我不得不打开一个会话，提供一个点击，然后抓取信息

编辑顺便说一句，它不是一个链接。看看源代码。我有一个问题，因为我是一个普通的R用户，而不是javascript程序员

这里有一个使用、（）和phantomjs的解决方案

然而，我不确定它有多有用，因为它在我的机器上运行得很慢，而且我不是phantomjs或selenium专家，所以我还不知道在哪里可以提高速度，所以需要研究一下

编辑

我又试了一次，速度似乎还可以

library(RSelenium)
library(rvest)

## Terminal command to start selenium (on ubuntu)
## cd ~/selenium && java -jar selenium-server-standalone-2.48.2.jar
url <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53"

RSelenium::startServer()
remDr <- remoteDriver(browserName = "phantomjs")

remDr$open()
remDr$navigate(url)

# css <- ".cpointer:nth-child(1)"  ## couldn't get this to work
xp <- "//div[@class='contactbox-indent rel brkword']"
webElem <- remDr$findElement(using = 'xpath', xp)

# webElem <- remDr$findElement(using = 'css selector', css)
webElem$clickElement()

## the page source now includes the clicked element
page_source <- remDr$getPageSource()[[1]]
pos <- regexpr('class=\\"xx-large', page_source)

## you could write a more intelligent regex, but this works for now
phone_number <- substr(page_source, pos + 11, pos + 21)
phone_number
# "503 155 744"

# remDr$close()
# remDr$closeServer()

库（RSelenium）
图书馆（rvest）
##启动selenium的终端命令（在ubuntu上）
##cd~/selenium&&java-jar selenium-server-standalone-2.48.2.jar
url您可以获取嵌入
标记中的数据，这些标记告诉onclick
处理程序要做什么，然后直接获取数据：
library(httr)
library(rvest)
library(purrr)
library(stringr)

URL <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53"

pg <- read_html(URL)

html_nodes(pg, "li.rel") %>%       # get the 'special' <li> tags
  html_attrs() %>%                 # extract all the attrs (they're non-standard)
  flatten_chr() %>%                # list to character vector
  keep(~grepl("rel \\{", .x)) %>%  # only want ones with 'hidden' secret data
  str_extract("(\\{.*\\})") %>%    # only get the data
  unique() %>%                     # there are duplicates
  map_df(function(x) {

    path <- str_match(x, "'path':'([[:alnum:]]+)'")[,2]                  # extract out the path
    id <- str_match(x, "'id':'([[:alnum:]]+)'")[,2]                      # extract out the id

    ajax <- sprintf("http://olx.pl/ajax/misc/contact/%s/%s/", path, id)  # make the AJAX/XHR URL
    value <- content(GET(ajax))$value                                    # get the data

    data.frame(path=path, id=id, value=value, stringsAsFactors=FALSE)    # make a data frame

  }) 

## Source: local data frame [3 x 3]
## 
##           path    id       value
##          (chr) (chr)       (chr)
## 1        phone dX6wf 503 155 744
## 2        skype dX6wf    e.bobruk
## 3 communicator dX6wf     7686136

库（httr）
图书馆（rvest）
图书馆（purrr）
图书馆（stringr）
URL%#提取所有属性（它们是非标准的）
将列表展平到字符向量
keep（~grepl（“rel\\{，.x））%>%#只需要具有“隐藏”秘密数据的
stru extract（“（\{.\\}）”）%>%\只获取数据
唯一（）%>%#存在重复项
地图测向（功能（x）{
您是否尝试过RSelenium？很高兴有一个解决方案不需要使用外部软件/程序。您是否遇到过需要使用类似于selenium
的东西的情况，或者您通常可以在R
中做任何事情？我尝试不使用它，因为RSelenium pkg cld中介绍的习惯用法使用了“Hadleyverse”在我看来，改头换面是必要的。