从R中的网站中删除表格_R_Web Scraping

从R中的网站中删除表格

r web-scraping

从R中的网站中删除表格,r,web-scraping,R,Web Scraping,我试图使用R从以下链接中提取表：我尝试了以下方法： url <- "https://pubchem.ncbi.nlm.nih.gov/compound/1983#section=DrugBank-Interactions&fullscreen=true" require(XML) url.table <- readHTMLTable(url, which = 1, header = FALSE, stringsAsFactors = FALSE) 我对网页抓取不太熟悉

我试图使用R从以下链接中提取表：

我尝试了以下方法：

 url <- "https://pubchem.ncbi.nlm.nih.gov/compound/1983#section=DrugBank-Interactions&fullscreen=true"
require(XML)
url.table <- readHTMLTable(url, which = 1, header = FALSE, stringsAsFactors = FALSE)

我对网页抓取不太熟悉，有没有办法将上面链接中的表格提取到R中？此外，我如何确定数据存储的格式；XML、JSON等

谢谢。

以下是一种解决javascript问题的RSelenium方法：

library(RSelenium)
library(rvest)

#this sets up the phantomjs driver
pjs <- wdman::phantomjs()

#open a connection to it
dr <- rsDriver(browser = 'phantomjs')
remdr <- dr[['client']]

#go to the site
remdr$navigate("https://pubchem.ncbi.nlm.nih.gov/compound/1983#section=DrugBank-Interactions&fullscreen=true")

#get tables
tables <- remdr$findElements('class', 'table-container')

tableList <- list()
for(i in 1:length(tables)){
  x <- tables[[i]]$getElementAttribute('innerHTML') %>%
    unlist() %>%
    read_html() %>%
    html_table()

  tableList[[i]] <- x[[1]]
}

库（RSelenium）
图书馆（rvest）
#这将设置phantomjs驱动程序
pjs以下是一种解决javascript问题的RSelenium方法：
library(RSelenium)
library(rvest)

#this sets up the phantomjs driver
pjs <- wdman::phantomjs()

#open a connection to it
dr <- rsDriver(browser = 'phantomjs')
remdr <- dr[['client']]

#go to the site
remdr$navigate("https://pubchem.ncbi.nlm.nih.gov/compound/1983#section=DrugBank-Interactions&fullscreen=true")

#get tables
tables <- remdr$findElements('class', 'table-container')

tableList <- list()
for(i in 1:length(tables)){
  x <- tables[[i]]$getElementAttribute('innerHTML') %>%
    unlist() %>%
    read_html() %>%
    html_table()

  tableList[[i]] <- x[[1]]
}

库（RSelenium）
图书馆（rvest）
#这将设置phantomjs驱动程序
pjs正如其他人所指出的，问题在于数据是用Javascript加载的，而不是在HTML中加载的，因此您需要一个工具来执行JS以提取信息。伊恩在他的回答中演示了如何使用RSelenium，它可以控制您机器上的浏览器，从而完成任务。在这种情况下，还有另一种不需要硒的方法
使用Chrome（其他浏览器可能也会这样做），您可以打开开发者工具并查看浏览器的网络活动。当此链接处于打开状态时，如果加载上面的链接，则可以看到该网页参与的所有后台活动。这一点很重要，因为Javascript不仅能让数据神奇地显示出来，还能从某处获取数据。此选项卡将让我们查看数据来自何处
当我这样做时，我看到的是：

下一步需要一些调查工作——我们需要找到加载数据的步骤。这通常是JSON格式的。在活动列表的大部分过程中，我们看到有两个JSON步骤，一个用于索引，一个用于数据。我们可以右键单击并在新选项卡中打开。

包含（乍一看）表中的所有数据。我们现在可以将这个链接读入R并提取表
library(httr)
library(jsonlite)
library(magrittr)

json = GET("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/1983/JSON/?") %>% 
  content(as='text') %>% 
  fromJSON()

extractValues = function(i,d){
  sv = dbi[d,][[1]][i,][[1]]$StringValue
  out = data.frame(Key = sv[1],Value = sv[2],stringsAsFactors = F)
  return(out)
}

dbi_Tables = lapply(1:nrow(dbi),function(d){
  out = lapply(1:nrow(dbi[d,][[1]]),extractValues,d=d) %>% 
    do.call(rbind,.)
  return(out)
})

GET
是从网站检索数据的http动词，来自httr
包content
将GET
的结果提取为文本，然后fromJSON
将其转换为R中的列表（并来自jsonlite
包）。现在我们有了一个大列表，可以导航到其中查找数据
json$Record$Section$TOCHeading

 [1] "2D Structure"                           "3D Conformer"                          
 [3] "LCSS"                                   "Names and Identifiers"                 
 [5] "Chemical and Physical Properties"       "Related Records"                       
 [7] "Chemical Vendors"                       "Drug and Medication Information"       
 [9] "Agrochemical Information"               "Pharmacology and Biochemistry"         
[11] "Use and Manufacturing"                  "Identification"                        
[13] "Safety and Hazards"                     "Toxicity"                              
[15] "Literature"                             "Patents"                               
[17] "Biomolecular Interactions and Pathways" "Biological Test Results"               
[19] "Classification"  

你正在寻找的数据是在“生物分子相互作用和途径”（第17个元素）中，这导致了另一个数据。第三行是显示药物库相互作用的框架
json$Record$Section$Section[[17]]$TOCHeading
[1] "Protein Bound 3-D Structures" "Biosystems and Pathways"      "DrugBank Interactions" 

这将提供一个data.frame，其中有一列，每行是一个长度为1的列表，其中包含一个dataframe
dbi = json$Record$Section$Section[[17]]$Information[[3]]$Table

我们可以编写一个函数并使用一些lappy
s来提取表
library(httr)
library(jsonlite)
library(magrittr)

json = GET("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/1983/JSON/?") %>% 
  content(as='text') %>% 
  fromJSON()

extractValues = function(i,d){
  sv = dbi[d,][[1]][i,][[1]]$StringValue
  out = data.frame(Key = sv[1],Value = sv[2],stringsAsFactors = F)
  return(out)
}

dbi_Tables = lapply(1:nrow(dbi),function(d){
  out = lapply(1:nrow(dbi[d,][[1]]),extractValues,d=d) %>% 
    do.call(rbind,.)
  return(out)
})

现在您有了一个键/值表列表
诚然，这比一个漂亮的rvest
调用要复杂得多，而且这个JSON非常混乱，但是作为处理JS加载数据的策略，它可以比RSelenium快得多，也不那么脆弱。
正如其他人所指出的，问题是数据是用Javascript加载的，而不是用HTML加载的，因此需要一个能够执行的工具Ian在他的回答中演示了如何使用RSelenium来控制机器上的浏览器以完成任务。在本例中，还有另一种不需要RSelenium的方法
使用Chrome（其他浏览器可能也会这样做），您可以打开开发人员工具并查看浏览器的网络活动。在打开时，如果加载上面的链接，您可以看到网页参与的所有后台活动。这一点很重要，因为Javascript不仅可以神奇地显示数据，还可以从某处获取数据。此选项卡将让我们查看数据的位置明来自
当我这样做时，我看到的是：

下一步需要做一些调查工作-我们需要找到加载数据的步骤。这通常是JSON格式的。在活动列表的大部分地方，我们看到有两个JSON步骤，一个用于索引，一个用于数据。我们可以右键单击并在新选项卡中打开。

包含（乍一看）表中的所有数据。我们现在可以将此链接读入R并提取表
library(httr)
library(jsonlite)
library(magrittr)

json = GET("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/1983/JSON/?") %>% 
  content(as='text') %>% 
  fromJSON()

extractValues = function(i,d){
  sv = dbi[d,][[1]][i,][[1]]$StringValue
  out = data.frame(Key = sv[1],Value = sv[2],stringsAsFactors = F)
  return(out)
}

dbi_Tables = lapply(1:nrow(dbi),function(d){
  out = lapply(1:nrow(dbi[d,][[1]]),extractValues,d=d) %>% 
    do.call(rbind,.)
  return(out)
})

GET
是一个http动词，用于从网站检索数据，来自httr
包。content
将GET
的结果提取为文本，并将fromJSON
转换为R中的列表（来自jsonlite
包。现在我们有一个大列表，可以导航到其中查找数据
json$Record$Section$TOCHeading

 [1] "2D Structure"                           "3D Conformer"                          
 [3] "LCSS"                                   "Names and Identifiers"                 
 [5] "Chemical and Physical Properties"       "Related Records"                       
 [7] "Chemical Vendors"                       "Drug and Medication Information"       
 [9] "Agrochemical Information"               "Pharmacology and Biochemistry"         
[11] "Use and Manufacturing"                  "Identification"                        
[13] "Safety and Hazards"                     "Toxicity"                              
[15] "Literature"                             "Patents"                               
[17] "Biomolecular Interactions and Pathways" "Biological Test Results"               
[19] "Classification"  

你正在寻找的数据是在“生物分子相互作用和途径”（第17个元素）中，这导致了另一个数据。第三行是显示药物库相互作用的框架
json$Record$Section$Section[[17]]$TOCHeading
[1] "Protein Bound 3-D Structures" "Biosystems and Pathways"      "DrugBank Interactions" 

这将提供一个data.frame，其中有一列，每行是一个长度为1的列表，其中包含一个dataframe
dbi = json$Record$Section$Section[[17]]$Information[[3]]$Table

我们可以编写一个函数并使用一些lappy
s来提取表
library(httr)
library(jsonlite)
library(magrittr)

json = GET("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/1983/JSON/?") %>% 
  content(as='text') %>% 
  fromJSON()

extractValues = function(i,d){
  sv = dbi[d,][[1]][i,][[1]]$StringValue
  out = data.frame(Key = sv[1],Value = sv[2],stringsAsFactors = F)
  return(out)
}

dbi_Tables = lapply(1:nrow(dbi),function(d){
  out = lapply(1:nrow(dbi[d,][[1]]),extractValues,d=d) %>% 
    do.call(rbind,.)
  return(out)
})

现在您有了一个键/值表列表
诚然，这比一个漂亮的rvest
调用要复杂得多，而且这个JSON非常混乱，但是作为处理JS加载数据的策略，它可以比RSelenium快得多，也不那么脆弱。
我不是专家，但我认为这是因为您的表没有包含在页面源代码中，而是通过JS加载的。您可能会有更多的运气它们来自原始源代码，也就是说，这一个似乎包含了html源代码中的所有数据。你的readHTMLTable
有url。2
但是你有url
在示例的顶部，我不是专家，但我认为这是因为你的表没有包含在页面源代码中，而是通过js加载的。你可能会更幸运地使用来自原始源的em，也就是说，这一个似乎包含html源中的所有数据。你的readHTMLTable
有url.2
但是你在示例顶部有url
哇，这是一个很棒的方法！@IanKloo我在工作中做了很多工作，并且（经常）遇到这个JS问题