基于R，RCurl的多web表挖掘_R_For Loop_Web Scraping_Rcurl

基于R，RCurl的多web表挖掘

r for-loop web-scraping

基于R，RCurl的多web表挖掘,r,for-loop,web-scraping,rcurl,R,For Loop,Web Scraping,Rcurl,首先，提前感谢您的回复我需要通过在各自的网页中加入一些较小的表来获得一个表。到目前为止，我已经能够提取信息，但无法使用循环自动提取。到目前为止，我的命令是： library(RCurl) library(XML) # index <- toupper(letters) # EDIT: index <- LETTERS index[1] <- "0-A" url <- paste("www.citefactor.org/journal-impact-factor-li

首先，提前感谢您的回复

我需要通过在各自的网页中加入一些较小的表来获得一个表。到目前为止，我已经能够提取信息，但无法使用循环自动提取。到目前为止，我的命令是：

library(RCurl)
library(XML)
# index <- toupper(letters)
# EDIT:
index <- LETTERS

index[1] <- "0-A"
url <- paste("www.citefactor.org/journal-impact-factor-list-2014_", index, ".html", sep="", collapse=";")
urls <- strsplit(url, ";") [[1]]

我得到以下错误：

data.frame中的错误（`Search Journal Impact Factor List 2014`=List（`0-A`=“N”）：参数表示行数不同：1110447874169486201189172837….

但是，如果

URL

只有一个元素，则该函数可以工作：

tabA <- read.html.tab(urls[1])
tabB <- read.html.tab(urls[2]) 
tab.if <- rbind(tabA,tabB)

ifacs <- tab.if[,27:ncol(tab.if)]
View(ifacs)

tabA您可以完全取消for
循环，然后执行以下操作：
Data <- lapply(urls, function(x){
  readHTMLTable(
    getURL(x),
    stringsAsFactors=F)[[2]]
})

我不确定您是否希望将所有内容合并到一个对象中，但如果是这样，您可以使用do.call（rbind，Data）
。此外，我认为这些URL中的每一个都返回了两个表，第一个表从页面顶部的搜索目录开始，这就是我使用
readHTMLTable(
    getURL(x),
    stringsAsFactors=F)[[2]]

在lappy
的内部，而不是
readHTMLTable(
        getURL(x),
        stringsAsFactors=F)

后者将为每个url返回两个表的列表-
R> head(url1[[1]])
  0-A &nbsp| B &nbsp| C &nbsp| D &nbsp| E &nbsp| F &nbsp| G &nbsp| H &nbsp| I &nbsp| J &nbsp| K &nbsp| L &nbsp| M &nbsp|
1   N &nbsp| O &nbsp| P &nbsp| Q &nbsp| R &nbsp| S &nbsp| T &nbsp| U &nbsp| V &nbsp| W &nbsp| X &nbsp| Y &nbsp| Z &nbsp|
##
R> head(url1[[2]])
  INDEX                                        JOURNAL      ISSN 2013/2014  2012  2011  2010  2009  2008
1     1 4OR-A Quarterly Journal of Operations Research 1619-4500     0.918  0.73 0.323  0.69  0.75     -
2     2                                  Aaohn Journal 0891-0162     0.608 0.856 0.509  0.56     -     -
3     3                                  Aapg Bulletin 0149-1423     1.832 1.768 1.831 1.964 1.448 1.364
4     4                                   AAPS Journal 1550-7416     3.905 4.386 5.086 3.942  3.54     -
5     5                              Aaps Pharmscitech 1530-9932     1.776 1.584 1.432 1.211  1.19 1.445
6     6                                   Aatcc Review 1532-8813     0.254 0.354 0.139 0.315 0.293 0.352

强制性Hadleyverse回答：
library(rvest)
library(dplyr)
library(magrittr)
library(pbapply)

urls <- sprintf("http://www.citefactor.org/journal-impact-factor-list-2014_%s.html", 
                c("0-A", LETTERS[-1]))

dat <- urls %>%
  pblapply(function(url) 
    html(url) %>% html_table(header=TRUE) %>% extract2(2)) %>%
  bind_rows()

glimpse(dat)

## Observations: 1547
## Variables:
## $ INDEX     (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
## $ JOURNAL   (chr) "4OR-A Quarterly Journal of Operations Researc...
## $ ISSN      (chr) "1619-4500", "0891-0162", "0149-1423", "1550-7...
## $ 2013/2014 (chr) "0.918", "0.608", "1.832", "3.905", "1.776", "...
## $ 2012      (chr) "0.73", "0.856", "1.768", "4.386", "1.584", "0...
## $ 2011      (chr) "0.323", "0.509", "1.831", "5.086", "1.432", "...
## $ 2010      (chr) "0.69", "0.56", "1.964", "3.942", "1.211", "0....
## $ 2009      (chr) "0.75", "-", "1.448", "3.54", "1.19", "0.293",...
## $ 2008      (chr) "-", "-", "1.364", "-", "1.445", "0.352", "1.4...

库（rvest）
图书馆（dplyr）
图书馆（magrittr）
图书馆（pbapply）
URL%html_表（标题=TRUE）%%>%extract2（2））%%>%
绑定_行（）
一瞥（dat）
##意见：1547
##变量：
##美元指数（整数）1，2，3，4，5，6，7，8，9，10，11，12，13，14，。。。
##$JOURNAL（chr）“4OR-运营研究季刊。。。
##$ISSN（chr）“1619-4500”、“0891-0162”、“0149-1423”、“1550-7”。。。
##2013/2014美元（chr）“0.918”、“0.608”、“1.832”、“3.905”、“1.776”和“。。。
##2012美元（chr）“0.73”、“0.856”、“1.768”、“4.386”、“1.584”、“0…”。。。
##2011美元（chr）“0.323”、“0.509”、“1.831”、“5.086”、“1.432”和“。。。
##2010美元（chr）“0.69”、“0.56”、“1.964”、“3.942”、“1.211”、“0…”。。。。
##2009美元（chr）“0.75”、“1.448”、“3.54”、“1.19”、“0.293”，。。。
##$2008（chr）“-”、“-”、“1.364”、“-”、“1.445”、“0.352”、“1.4…”。。。

rvest
为我们提供了html
和html\u表

我使用magrittr
仅用于extract2
，它只包装[[
，读起来更好（IMO）
pbapply
软件包包装了*apply
功能，并提供免费进度条
注意：bind\u rows
是最新的dplyr
，所以在使用它之前先抓住它。
非常感谢！它很有效！我一直在研究你对lappy
的使用，它属于我有时会遇到困难的一类函数。唯一的问题是：你如何获得url1
？不客气；url1
这只是一个临时对象，我用来显示没有[[2]]
的readHTMLTable（getURL（x））将返回什么。
readHTMLTable(
        getURL(x),
        stringsAsFactors=F)

R> head(url1[[1]])
  0-A &nbsp| B &nbsp| C &nbsp| D &nbsp| E &nbsp| F &nbsp| G &nbsp| H &nbsp| I &nbsp| J &nbsp| K &nbsp| L &nbsp| M &nbsp|
1   N &nbsp| O &nbsp| P &nbsp| Q &nbsp| R &nbsp| S &nbsp| T &nbsp| U &nbsp| V &nbsp| W &nbsp| X &nbsp| Y &nbsp| Z &nbsp|
##
R> head(url1[[2]])
  INDEX                                        JOURNAL      ISSN 2013/2014  2012  2011  2010  2009  2008
1     1 4OR-A Quarterly Journal of Operations Research 1619-4500     0.918  0.73 0.323  0.69  0.75     -
2     2                                  Aaohn Journal 0891-0162     0.608 0.856 0.509  0.56     -     -
3     3                                  Aapg Bulletin 0149-1423     1.832 1.768 1.831 1.964 1.448 1.364
4     4                                   AAPS Journal 1550-7416     3.905 4.386 5.086 3.942  3.54     -
5     5                              Aaps Pharmscitech 1530-9932     1.776 1.584 1.432 1.211  1.19 1.445
6     6                                   Aatcc Review 1532-8813     0.254 0.354 0.139 0.315 0.293 0.352

library(rvest)
library(dplyr)
library(magrittr)
library(pbapply)

urls <- sprintf("http://www.citefactor.org/journal-impact-factor-list-2014_%s.html", 
                c("0-A", LETTERS[-1]))

dat <- urls %>%
  pblapply(function(url) 
    html(url) %>% html_table(header=TRUE) %>% extract2(2)) %>%
  bind_rows()

glimpse(dat)

## Observations: 1547
## Variables:
## $ INDEX     (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
## $ JOURNAL   (chr) "4OR-A Quarterly Journal of Operations Researc...
## $ ISSN      (chr) "1619-4500", "0891-0162", "0149-1423", "1550-7...
## $ 2013/2014 (chr) "0.918", "0.608", "1.832", "3.905", "1.776", "...
## $ 2012      (chr) "0.73", "0.856", "1.768", "4.386", "1.584", "0...
## $ 2011      (chr) "0.323", "0.509", "1.831", "5.086", "1.432", "...
## $ 2010      (chr) "0.69", "0.56", "1.964", "3.942", "1.211", "0....
## $ 2009      (chr) "0.75", "-", "1.448", "3.54", "1.19", "0.293",...
## $ 2008      (chr) "-", "-", "1.364", "-", "1.445", "0.352", "1.4...