Javascript 下一页由PhantomJS提供_Javascript_R_Phantomjs

Javascript 下一页由PhantomJS提供

javascript r phantomjs

Javascript 下一页由PhantomJS提供,javascript,r,phantomjs,Javascript,R,Phantomjs,我想爬网所有链接href从https://www.vietnamworks.com/job-search/all-jobs. 我意识到该网站使用javascript呈现内容，所以我在R中使用phantomjs进行抓取，但我只能抓取第1页如何单击下一页并爬网所有rest链接假设数据就是您想要的……还有另一种方法可以实现这一点。如果您在chrome中的页面上单击鼠标右键并检查网络调用，您可以找到该站点为检索数据本身而进行的API调用。每个调用产生50个结果，它的最大值为5000个结果，所以当我测

我想爬网所有链接href从https://www.vietnamworks.com/job-search/all-jobs.

我意识到该网站使用javascript呈现内容，所以我在R中使用phantomjs进行抓取，但我只能抓取第1页

如何单击下一页并爬网所有rest链接

假设数据就是您想要的……还有另一种方法可以实现这一点。如果您在chrome中的页面上单击鼠标右键并检查网络调用，您可以找到该站点为检索数据本身而进行的API调用。每个调用产生50个结果，它的最大值为5000个结果，所以当我测试时，函数中的最大页面参数大约为96

.job_api <- function(page = 0){
  library(stringi)
  library(httr)
  # site url
  # 
  url <- "https://jf8q26wwud-dsn.algolia.net/1/indexes/*/queries?"
  # They put request headers into their query string directly
  string_heads <- c(
    "x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%20(lite)%203.24.5%3Binstantsearch.js%201.6.0%3BJS%20Helper%202.21.2",
    "x-algolia-application-id=JF8Q26WWUD",
    "x-algolia-api-key=M2UzZmI1Zjc1NGMwZmYzZjJiNWE0ZTgxMzNjNmIzMjc2ODEyZWQwZTJmYzNjMDhjNmU3NGQ3ZGViMzJiZTlkNHRhZ0ZpbHRlcnM9JnVzZXJUb2tlbj00ODBiNjRhNzI2NjQ3ODgwMThmNDhjZWNkYmVhNGVlYg%3D%3D"
  )

  api_url <- stri_join(c(url, stri_join(string_heads, collapse = "&")), collapse = "")

  # form body data
  body_part <- '{"requests":[{"indexName":"vnw_job_v2","params":"query=&hitsPerPage=50&maxValuesPerFacet=20&page=0&restrictSearchableAttributes=%5B%22jobTitle%22%2C%22skills%22%2C%22company%22%5D&facets=%5B%22categoryIds%22%2C%22locationIds%22%2C%22categories%22%2C%22locations%22%2C%22skills%22%2C%22jobLevel%22%2C%22company%22%5D&tagFilters="}]}'
  # replace the body of the form data request with regex.. this is ugly but quick
  body_post <- stri_replace_all_regex(body_part, "(?<=page\\=)[0-9]+", page)

  # Make the api call
  call <- POST(api_url, body = body_post)
  # if pass... return data or else fail with the response information
  if(status_code(call) == 200L){
    content(call)
  }else {
    return(call)
  }

}

下面是一些输出的样子

> test <- .job_api(0)
> length(test$results[[1]]$hits)
[1] 50
> names(test$results[[1]]$hits[[50]]$`_highlightResult`)
[1] "jobTitle"       "skills"         "company"        "jobDescription" "jobRequirement"
> test$results[[1]]$hits[[5]]$`_highlightResult`$skills[[1]]
$value
[1] "Process System Engineering"

$matchLevel
[1] "none"

$matchedWords
list()

获取按钮ID并执行onclick。谢谢Carl Boneri先生。谢谢Carl Boneri先生。您的代码运行良好。我想知道网站允许用户通过API获取的数据是否有一个限制？我猜是每个会话5000条记录或者其他什么。记住这是他们的内部API。所以不要压碎它。