带rvest的刮削台_R_Rvest - Fatal编程技术网

带rvest的刮削台

带rvest的刮削台,r,rvest,R,Rvest,我正试图从IPUM中删除此职业代码表：我有（丑陋的）代码可以工作（见下文），但我试图使用rvest，但失败了。我想知道有没有更整洁的方法检查页面时，我认为该表有一个选择器#dataTable>table>tbody，但我以前没有在选择器中看到，将该选择器放入html\u node（）不起作用： "https://usa.ipums.org/usa-action/variables/OCC1950#codes_section" %>% read_html() %>% htm

我正试图从IPUM中删除此职业代码表：

我有（丑陋的）代码可以工作（见下文），但我试图使用

rvest

，但失败了。我想知道有没有更整洁的方法

检查页面时，我认为该表有一个选择器

#dataTable>table>tbody

，但我以前没有在选择器中看到

，将该选择器放入

html\u node（）

不起作用：

"https://usa.ipums.org/usa-action/variables/OCC1950#codes_section" %>%
  read_html() %>%
  html_node('#dataTable > table > tbody') %>%
  html_table()

Error in UseMethod("html_table") : 
  no applicable method for 'html_table' applied to an object of class "xml_missing"

在选择器中仅使用

#dataTable

：

"https://usa.ipums.org/usa-action/variables/OCC1950#codes_section" %>%
  read_html() %>%
  html_node('#dataTable') %>%
  html_table()

Error: html_name(x) == "table" is not TRUE

使用

RCurl

、

jsonlite

和

stringr

对页面源代码进行切片，以下操作确实有效。也许这个表似乎是json格式的事实可以解释我的

rvest

尝试失败的原因，但我不确定

library(RCurl)
library(jsonlite)
library(magrittr)
library(stringr)

txt <- "https://usa.ipums.org/usa-action/variables/OCC1950#codes_section" %>%
  getURL()

ipums <-
  txt %>%
  str_extract(".*Farmer.*") %>%
  str_extract("\\[.*false\\}\\]") %>%
  fromJSON()

库（RCurl）
图书馆（jsonlite）
图书馆（magrittr）
图书馆（stringr）
txt%
getURL（）
ipums%
str_摘录（“.*Farmer.*”）%>%
str\u extract（“\\[.*false\\\\\\\]]”%>%
fromJSON（）

我认为这是其中一个网站，rvest并不是这份工作的最佳工具。