如何从R中的网站上刮取表格_R_Web Scraping_Rvest

如何从R中的网站上刮取表格

r web-scraping

如何从R中的网站上刮取表格,r,web-scraping,rvest,R,Web Scraping,Rvest,我想从中提取底部表格（“每日观察”）。我得到了表组件的完整xpath，但它显示了{xml_nodeset（0）}作为输出。我做错了什么？我使用了以下代码： library(rvest) single <- read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1') single %>% html_nodes(xpath = '/html/body/app

我想从中提取底部表格（“每日观察”）。我得到了表组件的完整xpath，但它显示了

{xml_nodeset（0）}

作为输出。我做错了什么？我使用了以下代码：

library(rvest)
single <- read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')  
single %>%
  html_nodes(xpath = '/html/body/app-root/app-history/one-column-layout/wu-header/sidenav/mat-sidenav-container/mat-sidenav-content/div/section/div[2]/div/div[5]/div/div/lib-city-history-observation/div/div[2]/table')

库（rvest）
单一%
html_节点（xpath='/html/body/app root/app history/one-column layout/wu header/sidenav/mat sidenav container/mat sidenav content/div/section/div[2]/div/div/div/div/lib city history observation/div/div[2]/table'）

表组件似乎是空的。

这是一个动态页面，表由Javascript生成。

rvest

单独使用是不够的。尽管如此，您还是可以从JSON API获取源内容

library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <- 
  read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <- 
  html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
  html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
  "https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
  str_apikey,
  "&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <- 
  json_obj$observations %>% as_tibble() %>% 
  mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>% 
                          with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
  select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table

库（tidyverse）
图书馆（rvest）
图书馆（lubridate）
图书馆（jsonlite）
#阅读静态html。它不会创建表，但它保存API键
#我们需要检索源JSON。
htm_obj%gsub（“^.*SUN_API_KEY&q；：&q；&q；&q；*$”，“，”）
#创建指向API'的URI，API键作为查询的第一个键值对
url_apijson这是一个动态页面，表由Javascript生成。
rvest
单独使用是不够的。尽管如此，您还是可以从JSON API获取源内容
library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <- 
  read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <- 
  html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
  html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
  "https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
  str_apikey,
  "&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <- 
  json_obj$observations %>% as_tibble() %>% 
  mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>% 
                          with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
  select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table

库（tidyverse）
图书馆（rvest）
图书馆（lubridate）
图书馆（jsonlite）
#阅读静态html。它不会创建表，但它保存API键
#我们需要检索源JSON。
htm_obj%gsub（“^.*SUN_API_KEY&q；：&q；&q；&q；*$”，“，”）
#创建指向API'的URI，API键作为查询的第一个键值对
我建议使用rnoaa包并直接从政府NOAA网站提取数据。另一个选项是使用web浏览器的开发人员工具，找到包含请求数据的JSON文件。@Dave2e rnoaa包不允许访问对我的分析很重要的最近（2010年后）的小时数据。我在wunderground网站上找不到分配给数据的JSON文件，可能没有。我建议使用rnoaa包，直接从政府NOAA网站提取数据。另一个选项是使用web浏览器的开发人员工具，找到包含请求数据的JSON文件。@Dave2e rnoaa包不允许访问对我的分析很重要的最近（2010年后）的小时数据。我在wunderground网站上找不到分配给数据的JSON文件，可能没有。这太神奇了！你为我节省了几天的复制和粘贴时间。非常感谢。这太棒了！你为我节省了几天的复制和粘贴时间。非常感谢。