R 硒；循环和下载csv文件_R_Loops_Web Scraping_Rvest_Rselenium

R 硒；循环和下载csv文件

r loops web-scraping

R 硒；循环和下载csv文件,r,loops,web-scraping,rvest,rselenium,R,Loops,Web Scraping,Rvest,Rselenium,我正在尝试使用RSelenium（与docker一起）从该网站提取数据： #——加载包图书馆（资源库）图书馆（rvest）库（xml2）图书馆（tidyverse） #--远程驱动器 remDr我的西班牙语有点生疏，但如果我没有弄错的话，你是在尝试先切换los filtros de búsqueda por Sector e Institucionón，然后通过SectorxInstitucionón组合如果您单击其中一个组合，例如Aportaciones de Seguridad S

我正在尝试使用RSelenium（与docker一起）从该网站提取数据：

#——加载包
图书馆（资源库）
图书馆（rvest）
库（xml2）
图书馆（tidyverse）
#--远程驱动器
remDr我的西班牙语有点生疏，但如果我没有弄错的话，你是在尝试先切换los filtros de búsqueda por Sector e Institucionón
，然后通过Sector
xInstitucionón
组合
如果您单击其中一个组合，例如Aportaciones de Seguridad Social
xFondo de la Vivienda del ISSSTE
，您可以观察以下网络请求：
method GET
url "https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/19/HC6/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada"
Headers:
Host: dgti-ejz-mspadronserpub.200.34.175.120.nip.io
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101                 
Firefox/71.0
Accept: application/json
Accept-Language: de,en-US;q=0.7,en;q=0.3
Accept-Encoding: gzip, deflate, br
Referer: https://nominatransparente.rhnet.gob.mx/
Origin: https://nominatransparente.rhnet.gob.mx
Connection: keep-alive
TE: Trailers

此响应是一个包含相关数据的JSON
，我们可以使用httr
在R
中发出相同的请求：
# Make the request
headers <- c(
    "Host" = "dgti-ejz-mspadronserpub.200.34.175.120.nip.io",
    "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv=71.0) Gecko/20100101 Firefox/71.0",
    "Accept" = "application/json",
    "Referer" = "https://nominatransparente.rhnet.gob.mx",
    "Origin" = "https://nominatransparente.rhnet.gob.mx",
    "Connection" = "keep-alive",
    "TE" = "Trailers"
)
url <- "https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/19/HC6/1/100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada"

response <- httr::GET(url, httr::add_headers(headers))
# Extract the data
data <- httr::content(response)
# Example, the first entry
data$listDtoServidorPublico[[1]]
# $nombres
# [1] "JOSE OSCAR"
# 
# $primerApellido
# [1] "ABURTO"
# 
# $segundoApellido
# [1] "LOPEZ"
# 
# $dependencia
# [1] "FONDO DE LA VIVIENDA DEL ISSSTE"
# 
# $tipoEntidad
# [1] "ORGANISMO DESCENTRALIZADO"
# 
# $nombrePuesto
# [1] "JEFE DE AREA PROF B EN PROC HIPOTEC FOVISSSTE"
# 
# $sueldoBase
# [1] 9432
# 
# $compensacionGarantizada
# [1] 2096

因此，我对url的哪一部分必须进行调整的看法是错误的：如果你比较一下这两个链接，就会发现它们的区别
 url_1 = x + 19/HC6 + y
 url_2 = x + 25/C00 + y
 # where
 x = https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/
 y = /100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada

因此，似乎每个扇区
xInstitucionón
都被编码为VW/XYZ
。如果检索所有这些，则可以迭代组合
最后，如果您进一步检查网络，可能会发现一些请求包含这些编码的映射
编辑2
正如所怀疑的那样，在检查网络时，我遇到了标有sectores.json
的请求，请求url如下https://nominatransparente.rhnet.gob.mx/assets/sectores.json
。这至少包含我所指的扇区
部分的映射。进一步研究可能会得到类似的instutución

您可能需要切换并单击给定的扇区
，然后查看给定扇区
的所有Institucon
选项。然后在DOM中，您将看到类似的映射。我建议：
1. Get the sector mapping
2. Find out inside the network how the list of instituciónes is given back. Probably something like:
-> Request containing sector-ID in the URL -> return a JSON with all instituciónes
3. Once you figure out the logic behind it, use httr::GET to create a list of all sector x institución
4. Once you have this list, iterate over all combinations to get JSON data as above.

你好，妮可。谢谢你的回答。这似乎是一个很好的方法。我想知道，这只检索迭代的第一页——大约100个注册中心（例如，第一部门
和第一机构
）。请问，您是如何从url
-即代码中的第11行获得所有信息的？嘿@niko。我看了你的编辑，非常清晰，非常有帮助。目前，我所做的只是手动获取每个部门和机构，并检索数据。这比我用硒做的更直观。另外，如果我可以问，既然它检索JSON列表，那么将其转换为数据帧的正确方法是什么？我想将所有检索到的数据继续放在一个数据帧中。@MaximilianoRodriguezdo.call（rbind，data$listDtoServidorPublico）应该为上述示例执行此操作
 url_1 = x + 19/HC6 + y
 url_2 = x + 25/C00 + y
 # where
 x = https://dgti-ejz-mspadronserpub.200.34.175.120.nip.io/ms/InfoPadron/servidoresPublicosSector/
 y = /100?query=nombres,primerApellido,segundoApellido,dependencia,tipoEntidad,nombrePuesto,sueldoBase,compensacionGarantizada

1. Get the sector mapping
2. Find out inside the network how the list of instituciónes is given back. Probably something like:
-> Request containing sector-ID in the URL -> return a JSON with all instituciónes
3. Once you figure out the logic behind it, use httr::GET to create a list of all sector x institución
4. Once you have this list, iterate over all combinations to get JSON data as above.