R中的HTML表刮取
我正在尝试获取以下url上的表: 问题是该表不是html表,因此html_table()不起作用 到目前为止,我已经尝试从表中提取节点,但它没有返回任何结果R中的HTML表刮取,html,r,web,web-scraping,Html,R,Web,Web Scraping,我正在尝试获取以下url上的表: 问题是该表不是html表,因此html_table()不起作用 到目前为止,我已经尝试从表中提取节点,但它没有返回任何结果 url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
webpage <- read_html(url)
table_html <- html_nodes(webpage, 'table#Tabc')
table <- html_table(table_html)
url=”https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024\u 01%2000%2001.html“
网页所以这里的问题是页面是通过javascript呈现的。因此,rvest
单独使用将不起作用。解决这个问题的最简单方法之一是使用无头web浏览器。我们可以使用
首先,下载相应版本的,并将可执行文件(假设为Windows)放在您的工作目录中。也就是说,phantomjs.exe
位于R
脚本的工作目录中
创建一个scrape.js
文件:
// scrape.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'page.html';
page.open('https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html', function (status) {
var content = page.content;
fs.write(path,content,'w');
phantom.exit();
});
这个scrap.js
文件一旦运行,将在您的工作目录中创建一个page.html
文件。回到R
或RStudio
中,您可以执行以下操作:
library(tidyverse)
library(rvest)
# Run scrape.js with PhantomJS to create the file page.html
system("./phantomjs scrape.js")
# Now we should be in business as usual:
read_html('page.html') %>%
html_nodes("table#Tabc") %>%
html_table(header = TRUE) %>%
.[[1]] %>%
as_tibble()
# A tibble: 504 x 38
Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5
2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5
3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5
4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5
5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5
6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5
7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5
8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5
9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5
10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5
# ... with 494 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
# `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
# ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
# (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
# Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
# 09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
# generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
# rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
# min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
# suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
# regulacion secundaria ($/MW` <dbl>
接下来,创建要循环/漫游/映射的列表(显然,这可以被清理/抽象,以便更易于维护,并且需要更少的键入):
url以下内容不是很优雅,但应该可以使用
library(curl)
library(xml2)
url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
fi <- tempfile()
h <- new_handle(ssl_verifypeer = FALSE)
str_page <- rawToChar(curl_fetch_memory(url, h)$content)
xml_page <- read_html(str_page)
txt <- xml_text(xml_find_all(xml_page, "//script"))
txt <- unlist(strsplit(txt, ";", fixed = TRUE))
str(as.list(txt))
clean <- function(x) trimws(gsub('"', "", x))
cnames <- txt[grep("vnctab\\s*=", txt)]
cnames <- gsub("(^.*?\\[|\\]\\s*$)", "", cnames)
cnames <- clean(unlist(strsplit(cnames, ",")))
tab <- txt[grep("vdatrep\\s*=", txt)]
substr(tab, 1, 1000)
substr(tab, nchar(tab)-1000, nchar(tab))
tab <- gsub("^.*?\\[\\s*\\[", "", tab)
tab <- gsub("\\],*\\s*\\]$", "", tab)
tab_rows <- unlist(strsplit(tab, "\\]\\s*,*\\s*\\["))
tab <- strsplit(tab_rows, ",")
M <- do.call(rbind, lapply(tab, clean))
d1 <- as.data.frame(M[,1:2], stringsAsFactors = FALSE)
d2 <- as.data.frame(apply(M[,-(1:2)], 2, as.double), stringsAsFactors = FALSE)
d <- cbind(d1, d2)
dim(d); length(cnames)
colnames(d) <- cnames
sapply(d, class)
str(d)
库(curl)
库(xml2)
url=”https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024\u 01%2000%2001.html“
这很好,只有一个问题。我的主要目标是创建一个循环,以便在许多日期之间进行废弃,但是由于url和日期参数是phantom提示符上的输入,而不是R中的输入,我想知道是否可以在phantom中创建类似于虚拟url html.page的内容,然后在R中放置正确的urlurl@Garcher啊,我明白了,,我确信有一种方法可以将参数传递给系统
调用,只是一时想不起来。也许值得问一个关于幻影的问题。我会花几分钟来看看我是否能弄明白。你有一些日期/网址,你可以把你的问题作为一个例子吗?太好了,我接受你的建议。也许这些例子会有所帮助
urls <- list(
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html',
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html',
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html'
)
paths <- list(
'page1.html',
'page2.html',
'page3.html'
)
args_list <- map2(urls, paths, paste)
# We are only using this function for the file creation side-effects,
# so we can use walk instead of map.
# This creates the files: page1.html, page2.html, and page3.html
walk(args_list, ~ system(paste("./phantomjs scrape2.js", .)))
read_page <- function(page) {
read_html(page) %>%
html_nodes("table#Tabc") %>%
html_table(header = TRUE) %>%
.[[1]] %>%
as_tibble()
}
paths %>%
map(~ read_page(.)) %>%
bind_rows()
# A tibble: 9,000 x 38
Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5
2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5
3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5
4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5
5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5
6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5
7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5
8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5
9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5
10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5
# ... with 8,990 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
# `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
# ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
# (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
# Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
# 09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
# generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
# rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
# min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
# suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
# regulacion secundaria ($/MW` <dbl>
library(curl)
library(xml2)
url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
fi <- tempfile()
h <- new_handle(ssl_verifypeer = FALSE)
str_page <- rawToChar(curl_fetch_memory(url, h)$content)
xml_page <- read_html(str_page)
txt <- xml_text(xml_find_all(xml_page, "//script"))
txt <- unlist(strsplit(txt, ";", fixed = TRUE))
str(as.list(txt))
clean <- function(x) trimws(gsub('"', "", x))
cnames <- txt[grep("vnctab\\s*=", txt)]
cnames <- gsub("(^.*?\\[|\\]\\s*$)", "", cnames)
cnames <- clean(unlist(strsplit(cnames, ",")))
tab <- txt[grep("vdatrep\\s*=", txt)]
substr(tab, 1, 1000)
substr(tab, nchar(tab)-1000, nchar(tab))
tab <- gsub("^.*?\\[\\s*\\[", "", tab)
tab <- gsub("\\],*\\s*\\]$", "", tab)
tab_rows <- unlist(strsplit(tab, "\\]\\s*,*\\s*\\["))
tab <- strsplit(tab_rows, ",")
M <- do.call(rbind, lapply(tab, clean))
d1 <- as.data.frame(M[,1:2], stringsAsFactors = FALSE)
d2 <- as.data.frame(apply(M[,-(1:2)], 2, as.double), stringsAsFactors = FALSE)
d <- cbind(d1, d2)
dim(d); length(cnames)
colnames(d) <- cnames
sapply(d, class)
str(d)