Javascript 使用phantomjs和rvest刮取网页

Javascript 使用phantomjs和rvest刮取网页,javascript,html,r,web-scraping,rvest,Javascript,Html,R,Web Scraping,Rvest,我正在尝试刮取以下网页: https://www.occ.com.mx/empleos-en-nuevo-leon使用rvest并选择Orgadget,这似乎很简单 然而,它似乎是用javascript构建的,所以我遵循了教程,安装了phantomjs,并尝试用下面的脚本在本地构建html网页 // scrape_occ.js var webPage = require('webpage'); var page = webPage.create(); var fs = require('fs

我正在尝试刮取以下网页:
https://www.occ.com.mx/empleos-en-nuevo-leon
使用
rvest
并选择Orgadget,这似乎很简单

然而,它似乎是用javascript构建的,所以我遵循了教程,安装了
phantomjs
,并尝试用下面的脚本在本地构建html网页

// scrape_occ.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'occ.html'

page.open('https://www.occ.com.mx/empleos-en-nuevo-leon', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
}); 
并调用R:
system(“phantomjs occ.js”)
在本地构建html,但会产生与原始
rvest
尝试类似的结果。我得到了更多的节点,但是一个表单似乎挡住了我的去路

p <- read_html("occ.html")
p %>% html_structure()
<html>
  <head>
    {text}
    <meta [http-equiv, content]>
    {text}
    <meta [http-equiv, content]>
    {text}
    <meta [http-equiv, content]>
    {text}
    <noscript>
      {text}
    {text}
  <body>
    <apm_do_not_touch>
      {text}
      <script [language]>
        {cdata}
      {text}
    {text}
    <form [action, method]>
      <input [type, name, value]>
      <input [type, name, value]>
      <input [type, name, value]>
      <input [type, name, value]>
      <input [type, name, value]>
      <input [type, name, value]>
      <input [type, name, value]>
      <input [type, name, value]>
p%html\u结构()
{text}
{text}
{text}
{text}
{text}
{text}
{text}
{cdata}
{text}
{text}
正确的方法是什么?我错过了什么

RSelenium(考虑到它正在编写浏览器脚本,这是我最后的选择)可以帮助:

library(RSelenium)
library(rvest)
library(dplyr)

checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.occ.com.mx/empleos-en-nuevo-leon")
pg <- remDr$getPageSource()
remDr$close()

doc <- read_html(pg[[1]])

trimws(html_text(html_nodes(doc, "div.lineamod_sr")))

##  [1] "Feb 16  \n                Coordinador de Operaciones (Zona Cienega de Flores)"      
##  [2] "Feb 16  \n                Coordinador de Operaciones (Zona Apodaca)"                
##  [3] "Feb 16  \n                Ventas de Tecnologias de informacion (zona Monterrey)"    
##  [4] "Feb 16  \n                Practicante en Capacitación"                              
##  [5] "Feb 16  \n                Home Office: Reclutador Bilingüe"                         
##  [6] "Feb 16  \n                SISTEMAS"                                                 
##  [7] "Feb 16  \n                Asociado de RH"                                           
##  [8] "Feb 16  \n                Practicante Import Export"                                
##  [9] "Feb 16  \n                Comprador de refacciones"                                 
## [10] "Feb 16  \n                Maestra Auxiliar de Preescolar"                          
并且,它返回相同的结果

这个
curlconverter
东西实际上实现了如下功能:

httr::VERB(
  verb = "POST", 
  url = "https://www.occ.com.mx/empleos-en-nuevo-leon", 
  add_headers(Pragma = "no-cache", 
              Origin = "https://www.occ.com.mx", 
              `Accept-Encoding` = "gzip, deflate", 
              `Accept-Language` = "en-US,en;q=0.8", 
              `Upgrade-Insecure-Requests` = "1", 
              `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
              Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
              `Cache-Control` = "no-cache", 
              Referer = "https://www.occ.com.mx/empleos-en-nuevo-leon", 
              Connection = "keep-alive", 
              DNT = "1"), 
  set_cookies(cookie_home3 = "240784394.20480.0000", 
              TS01650164_28 = "011b64c78f24f98a905a1610d142f8f5e8e6de557a270318cd246d8bdad634ad9d692ef9a7e7c5f5ee6125dfa6d154a7ecc9912326", 
              q = "", cat = "", fn = "", 
              salary = "", tm = "60", 
              loc = "", page = "1", TS01650164_77 = "08cb11f491ab28002f9f692817ac18f609a6564679dd056ae514dd5ed8f04683e016c84db8ad4aff81249c3ffa6f4d72086a11c1a482380047cb7d1b1a33ef2b110ce9bf57566934a0ba894dd447b3cf48fdfb5b4e17eb905895ebb597a5e878746dd0fba0f4fbb336e2f855820d002a", 
              JobSeekerInfo = "5877fd484ea1405b8e4780c2c0b05365|", 
              searchtype = "1", loctemp = "MX-NL", 
              fntemp = "MX-NL", cattemp = "", 
              qtemp = "", salarytemp = "", 
              tmtemp = "", jobids = "8687636,8687651,8689022,8688254,8688719,8688496,8687786,8687676,8688424,8654486,8688354,8687735,8687291,8688736,8687917,8688614,8687428,8688586,8688659,8687483,8688163,8654471,8687688,8688932,8689017,8642398,8687594,8688554,8688940,8687254,8688285,8415967,8687500,8688224,8687619,8688640,8688757,8689026,8688796,8702349,8728708,8728866,8728860,8721116,8591064,8729151,8729124,8729136,8729485,8551881", 
              returnurl = "/empleos-en-nuevo-leon", 
              Culture = "es-MX", TS01650164 = "01b518c62e8b19f2612cfbd7378cc0b73784ab414f17fdcc566da5c243d4a31e6ef629c80498c411d543fa42cb92729a648fc738579c3333ce81ff4708684a7e9f18f32db913b5b26edaa3bb4a0ad4d5ee740b16ab26bce774e39df018e046ddcf1c6529ce4b2194e780f9f039d729db6d958e850ead9570fc4748aa65cd7f85c0b79744cbf998c055a8314887942f194106c71e4cc2f736e495d21f228e9ceabfd36dc194b5ee20d31a00bc2b6469c8645ed502656b60a0a9b38aaf38a3435488f6e942917750d71f27386a87f3ff25f3bd1d163c0e6bd60212d36a3056fbdd4e055dab8c", 
              sitemgrOCCM = "false", 
              sitemgrRUE = "false", loaded = "1"), 
  body = list(TS01650164_id = "3", 
              TS01650164_cr = "08cb11f491ab2800f5f07209cf7558bf5a68e875ec0c3a7ca8beb59ff0f5da214e8ca0f0b0d3925c3466838922d655d208203848fa894800ea2891c392679d0ca17d55a2a828b5e4bbe97f5e3310800fb216c3a9821974a3aa20fa64d8feec278da6e1000bf2a8ac7f2533a8b0e5f5424e20ea558b261d881d570b85067b4859", 
              TS01650164_76 = "0", TS01650164_86 = "0", 
              TS01650164_md = "1", TS01650164_rf = "0", 
              TS01650164_ct = "0", TS01650164_pd = "0"), 
  encode = "form") 

问题在于,它使用那些依赖于时间的cookie设置发出
POST
请求。因此,如果在从开发人员工具会话中提取copyascurl命令后相对较快地完成,那么这将是可行的,但我不确定之后需要多长时间

谢谢!尽管如此,这仍然抛出了这个错误:(我正在尝试调试):-)这将是我对Selenium的第二个问题。让it在不同平台上始终如一地工作。我仍然无法在El Capitan上使用phantomjson。我刚刚更新了它,显示普通的ol
rvest
似乎可以工作。这对你来说不算什么?我昨天浪费了几个小时试图让硒元素发挥作用,但没有用。我不确定为什么我的
rvest
不起作用,但今晚晚些时候我会尝试你的方法(我现在无法测试)。谢谢,等一下。我才意识到在那之前我做了些什么。将更新。
httr::VERB(
  verb = "POST", 
  url = "https://www.occ.com.mx/empleos-en-nuevo-leon", 
  add_headers(Pragma = "no-cache", 
              Origin = "https://www.occ.com.mx", 
              `Accept-Encoding` = "gzip, deflate", 
              `Accept-Language` = "en-US,en;q=0.8", 
              `Upgrade-Insecure-Requests` = "1", 
              `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
              Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
              `Cache-Control` = "no-cache", 
              Referer = "https://www.occ.com.mx/empleos-en-nuevo-leon", 
              Connection = "keep-alive", 
              DNT = "1"), 
  set_cookies(cookie_home3 = "240784394.20480.0000", 
              TS01650164_28 = "011b64c78f24f98a905a1610d142f8f5e8e6de557a270318cd246d8bdad634ad9d692ef9a7e7c5f5ee6125dfa6d154a7ecc9912326", 
              q = "", cat = "", fn = "", 
              salary = "", tm = "60", 
              loc = "", page = "1", TS01650164_77 = "08cb11f491ab28002f9f692817ac18f609a6564679dd056ae514dd5ed8f04683e016c84db8ad4aff81249c3ffa6f4d72086a11c1a482380047cb7d1b1a33ef2b110ce9bf57566934a0ba894dd447b3cf48fdfb5b4e17eb905895ebb597a5e878746dd0fba0f4fbb336e2f855820d002a", 
              JobSeekerInfo = "5877fd484ea1405b8e4780c2c0b05365|", 
              searchtype = "1", loctemp = "MX-NL", 
              fntemp = "MX-NL", cattemp = "", 
              qtemp = "", salarytemp = "", 
              tmtemp = "", jobids = "8687636,8687651,8689022,8688254,8688719,8688496,8687786,8687676,8688424,8654486,8688354,8687735,8687291,8688736,8687917,8688614,8687428,8688586,8688659,8687483,8688163,8654471,8687688,8688932,8689017,8642398,8687594,8688554,8688940,8687254,8688285,8415967,8687500,8688224,8687619,8688640,8688757,8689026,8688796,8702349,8728708,8728866,8728860,8721116,8591064,8729151,8729124,8729136,8729485,8551881", 
              returnurl = "/empleos-en-nuevo-leon", 
              Culture = "es-MX", TS01650164 = "01b518c62e8b19f2612cfbd7378cc0b73784ab414f17fdcc566da5c243d4a31e6ef629c80498c411d543fa42cb92729a648fc738579c3333ce81ff4708684a7e9f18f32db913b5b26edaa3bb4a0ad4d5ee740b16ab26bce774e39df018e046ddcf1c6529ce4b2194e780f9f039d729db6d958e850ead9570fc4748aa65cd7f85c0b79744cbf998c055a8314887942f194106c71e4cc2f736e495d21f228e9ceabfd36dc194b5ee20d31a00bc2b6469c8645ed502656b60a0a9b38aaf38a3435488f6e942917750d71f27386a87f3ff25f3bd1d163c0e6bd60212d36a3056fbdd4e055dab8c", 
              sitemgrOCCM = "false", 
              sitemgrRUE = "false", loaded = "1"), 
  body = list(TS01650164_id = "3", 
              TS01650164_cr = "08cb11f491ab2800f5f07209cf7558bf5a68e875ec0c3a7ca8beb59ff0f5da214e8ca0f0b0d3925c3466838922d655d208203848fa894800ea2891c392679d0ca17d55a2a828b5e4bbe97f5e3310800fb216c3a9821974a3aa20fa64d8feec278da6e1000bf2a8ac7f2533a8b0e5f5424e20ea558b261d881d570b85067b4859", 
              TS01650164_76 = "0", TS01650164_86 = "0", 
              TS01650164_md = "1", TS01650164_rf = "0", 
              TS01650164_ct = "0", TS01650164_pd = "0"), 
  encode = "form")