Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/82.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何从R中的网站上刮取表格_R_Web Scraping_Rvest - Fatal编程技术网

如何从R中的网站上刮取表格

如何从R中的网站上刮取表格,r,web-scraping,rvest,R,Web Scraping,Rvest,我想从中提取底部表格(“每日观察”)。我得到了表组件的完整xpath,但它显示了{xml_nodeset(0)}作为输出。我做错了什么?我使用了以下代码: library(rvest) single <- read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1') single %>% html_nodes(xpath = '/html/body/app

我想从中提取底部表格(“每日观察”)。我得到了表组件的完整xpath,但它显示了
{xml_nodeset(0)}
作为输出。我做错了什么?我使用了以下代码:

library(rvest)
single <- read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')  
single %>%
  html_nodes(xpath = '/html/body/app-root/app-history/one-column-layout/wu-header/sidenav/mat-sidenav-container/mat-sidenav-content/div/section/div[2]/div/div[5]/div/div/lib-city-history-observation/div/div[2]/table')
库(rvest)
单一%
html_节点(xpath='/html/body/app root/app history/one-column layout/wu header/sidenav/mat sidenav container/mat sidenav content/div/section/div[2]/div/div/div/div/lib city history observation/div/div[2]/table')

表组件似乎是空的。

这是一个动态页面,表由Javascript生成。
rvest
单独使用是不够的。尽管如此,您还是可以从JSON API获取源内容

library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <- 
  read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <- 
  html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
  html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
  "https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
  str_apikey,
  "&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <- 
  json_obj$observations %>% as_tibble() %>% 
  mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>% 
                          with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
  select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table
库(tidyverse)
图书馆(rvest)
图书馆(lubridate)
图书馆(jsonlite)
#阅读静态html。它不会创建表,但它保存API键
#我们需要检索源JSON。
htm_obj%gsub(“^.*SUN_API_KEY&q;:&q;&q;&q;*$”,“,”)
#创建指向API'的URI,API键作为查询的第一个键值对

url_apijson这是一个动态页面,表由Javascript生成。
rvest
单独使用是不够的。尽管如此,您还是可以从JSON API获取源内容

library(tidyverse)
library(rvest)
library(lubridate)
library(jsonlite)

# Read static html. It won't create the table, but it holds the API key
# we need to retrieve the source JSON.

htm_obj <- 
  read_html('https://www.wunderground.com/history/daily/us/dc/washington/KDCA/date/2011-1-1')

# Retrieve the API key. This key is stored in a node with javascript content.
str_apikey <- 
  html_node(htm_obj, xpath = '//script[@id="app-root-state"]') %>%
  html_text() %>% gsub("^.*SUN_API_KEY&q;:&q;|&q;.*$", "", . )

# Create a URI pointong to the API', with the API key as the first key-value pair of the query
url_apijson <- paste0(
  "https://api.weather.com/v1/location/KDCA:9:US/observations/historical.json?apiKey=",
  str_apikey,
  "&units=e&startDate=20110101&endDate=20110101")
# Capture the JSON
json_obj <- fromJSON(txt = url_apijson)

# Wrangle the JSON's contents into the table you need
tbl_daily <- 
  json_obj$observations %>% as_tibble() %>% 
  mutate(valid_time_gmt = as_datetime(valid_time_gmt) %>% 
                          with_tz("America/New_York")) %>% # The timezone this airport (KDCA) is located at.
  select(valid_time_gmt, temp, dewPt, rh, wdir_cardinal, gust, pressure, precip_hrly) # The equvalent variables of your html table
库(tidyverse)
图书馆(rvest)
图书馆(lubridate)
图书馆(jsonlite)
#阅读静态html。它不会创建表,但它保存API键
#我们需要检索源JSON。
htm_obj%gsub(“^.*SUN_API_KEY&q;:&q;&q;&q;*$”,“,”)
#创建指向API'的URI,API键作为查询的第一个键值对

我建议使用rnoaa包并直接从政府NOAA网站提取数据。另一个选项是使用web浏览器的开发人员工具,找到包含请求数据的JSON文件。@Dave2e rnoaa包不允许访问对我的分析很重要的最近(2010年后)的小时数据。我在wunderground网站上找不到分配给数据的JSON文件,可能没有。我建议使用rnoaa包,直接从政府NOAA网站提取数据。另一个选项是使用web浏览器的开发人员工具,找到包含请求数据的JSON文件。@Dave2e rnoaa包不允许访问对我的分析很重要的最近(2010年后)的小时数据。我在wunderground网站上找不到分配给数据的JSON文件,可能没有。这太神奇了!你为我节省了几天的复制和粘贴时间。非常感谢。这太棒了!你为我节省了几天的复制和粘贴时间。非常感谢。