用R删除政府信息网站

用R删除政府信息网站,r,web-scraping,rvest,R,Web Scraping,Rvest,我正在为一个关于在线请愿书的研究项目抓取一个加拿大联邦网站。这是整个网站: 我需要获得每个请愿书的信息:请愿书的超链接、请愿书编号、标题、问题、请愿人、收到日期、状态、摘要 例如土著事务 [],我从以下代码开始,但在找到带有//h1的标题后被阻止 library("rvest") library("tm") # tm -> making a corpus and saving it library("lubridate") BASE <- "http://www.oag-

我正在为一个关于在线请愿书的研究项目抓取一个加拿大联邦网站。这是整个网站:

我需要获得每个请愿书的信息:请愿书的超链接、请愿书编号、标题、问题、请愿人、收到日期、状态、摘要

例如土著事务 [],我从以下代码开始,但在找到带有//h1的标题后被阻止

 library("rvest")
 library("tm")
 # tm -> making a corpus and saving it
 library("lubridate")

 BASE <- "http://www.oag-bvg.gc.ca/internet/English/pet_lp_e_940.html"
 url <- paste0(BASE, 'http://www.oag-    bvg.gc.ca/internet/English/pet_lpf_e_38167.html') 
 page <- html(url)
 paras <- html_text(html_nodes(page, xpath='//p'))

 text <- paste(paras, collapse =' ')

 getdata <- function(url){ 
 page <- html(url)
 title <- html_text(html_node(page, xpath='//h1'))

 # The following code is just a copy-paste of a code someone gave me.

 list(title=tit, 
   date=parse_date_time(date, "%B %d, %Y"), 
   text=paste(text, collapse=' '))
 }


 index <- html(paste0(BASE, "index.html"))
 links <- html_nodes(index, xpath='//ul/li/a')

 texts <- c() 
 authors <- c()
 dates <- c()
 for (s in slinks){
 page <- paste0(BASE, s)
 cat('.') ## progress
 d <- getdata(page)
 texts <- append(texts, d$text)
 authors <- append(authors, d$author)
 dates <- append(dates, d$date)
 }
library(“rvest”)
图书馆(“tm”)
#tm->制作语料库并保存
图书馆(“润滑”)
基本
库(XML)
图书馆(rvest)
#请仅在网站允许您报废时使用此代码
#获取主页上与在线请愿相关的所有HTML链接
kk%#h1是标题,p是段落
html_text()%>%
.[1:7] %>%
cbind(,link=paste0(“http://www.oag-bvg.gc.ca“,y))
})
例如。,
>ee[[1]]
[1,]“联邦政府在应对奥贝德山煤矿煤泥泄漏至阿萨巴斯卡河流域方面的作用和行动”                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[2,]“请愿书:362”                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[3,]“问题:土著事务、合规和执法、人类/环境健康、有毒物质、水”                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[4,]“请愿人:阿萨巴斯卡流域社会和生态正义的守护者”                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[5,]收到日期:2014年3月24日                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[6,]“状态:已完成”
library(XML)
library(rvest)
#please use this code only if the website allows you to scrap
#get all HTML links on the home page related to online petition
kk<-getHTMLLinks("http://www.oag-bvg.gc.ca/internet/English/pet_lp_e_940.html") 
#iterate over each title petition with the pattern pet_lpf_e and get all associated petitions under that title
dd<-lapply(grep("pet_lpf_e",kk,value=TRUE),function(x){
  paste0("http://www.oag-bvg.gc.ca",x) %>%
    getHTMLLinks
})
#get all the weblinks
 ee<-do.call(rbind,lapply(dd,function(x)grep("pet_[0-9]{3}_e",x,value=TRUE)))
#iterate over ff and get the details for each petition
ff<-lapply(ee,function(y){
      paste0("http://www.oag-bvg.gc.ca",y) %>%
    html%>%
    html_nodes(c("p","h1"))%>% #h1 is title and p is paragraph
    html_text() %>%
    .[1:7] %>%
    cbind(.,link=paste0("http://www.oag-bvg.gc.ca",y))
})

e.g., 

    > ee[[1]]

    [1,] "Federal role and action in response to the Obed Mountain Mine coal slurry spill into the Athabasca River watershed"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
    [2,] "Petition: 362 "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
    [3,] "Issue(s): Aboriginal affairs, compliance and enforcement, human/environmental health, toxic substances, water"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
    [4,] "Petitioner(s): Keepers of the Athabasca Watershed Society and Ecojustice"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
    [5,] "Date Received: 24 March 2014"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
    [6,] "Status: Completed"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
    [7,] "Summary: The petition raises concerns about the federal government’s role and actions in response to the October 2013 Obed Mountain Mine coal slurry spill into the Athabasca River watershed. The petition summarizes the events surrounding the spill, and includes information about the toxic substances that may have been contained in the slurry, such as polycyclic aromatic hydrocarbons, arsenic, cadmium, lead, and mercury. According to the petition, about 670 million litres of slurry were released into the environment; the spill had an impact on fish habitat in nearby streams; and the plume may have travelled far downstream and had a potential impact on municipal drinking water. The petitioners ask the government about its approvals and inspections prior to the spill, as well as its response to the spill, including investigations, future monitoring, and habitat remediation. "
         link                                                            
    [1,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [2,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [3,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [4,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [5,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [6,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [7,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"