在foreach R中使用列表_R_List_Foreach_Parallel Processing

在foreach R中使用列表

r list parallel-processing

在foreach R中使用列表,r,list,foreach,parallel-processing,R,List,Foreach,Parallel Processing,我正在尝试并行化提取一些html文档中保存的数据，并将其存储到data.frames中，以获取数百万个文档，因此并行化非常有用在第一步中，在我注册队列的机器上，我选择html文件的一个子集，并向它们应用rvest包中的read_html函数，我还尝试了XML包中的类似函数，但我遇到了内存泄漏问题，无法获得存储许多html页面内容的唯一列表然后我在这个列表上使用一个迭代器来获取较小的数据块，并将其反馈给foreach 在foreach中，我使用html_table函数和一些基本数据操作构建da

我正在尝试并行化提取一些html文档中保存的数据，并将其存储到data.frames中，以获取数百万个文档，因此并行化非常有用

在第一步中，在我注册队列的机器上，我选择html文件的一个子集，并向它们应用rvest包中的read_html函数，我还尝试了XML包中的类似函数，但我遇到了内存泄漏问题，无法获得存储许多html页面内容的唯一列表

然后我在这个列表上使用一个迭代器来获取较小的数据块，并将其反馈给foreach

在foreach中，我使用html_table函数和一些基本数据操作构建data.frames，并返回一个列表，其中的元素是清理后的data.frames

我尝试在Win8上使用doSNOW后端，在ubuntu 16.04上使用doRedis后端

在第一种情况下，返回空列表列表，而在第二种情况下，抛出内存映射错误；你可以在问题的最底层找到回溯

据我所知，我发送到核心的列表块的行为并不像我预期的那样。我已经收集到列表对象可能只是一组指针，但我无法确认它；也许这就是问题所在？是否有一种替代列表方式来封装多个html页面的数据

下面您可以找到一些重现该问题的代码。我对堆栈溢出、并行编程和R编程都是一个全新的人：欢迎提出任何改进建议。提前谢谢大家

library(rvest)
library(foreach)

#wikipedia pages of olympic medalist between 1992 and 2016 are
# downloaded for reproducibility
for(i in seq(1992, 2016, by=4)){

  html = paste("https://en.wikipedia.org/wiki/List_of_", i, "_Summer_Olympics_medal_winners", sep="")
  con = url(html)
  htmlCode = readLines(con)
  writeLines(htmlCode, con=paste(i, "medalists", sep="_"))
  close(con)

}

#declaring the redis backend (doSNOW code is also included below)

#note that I am using the package from 
#devtools::install_github("bwlewis/doRedis") due to a "nodelay error"
#(more info on that here: https://github.com/bwlewis/doRedis/issues/24)
# if it is not your case please drop the nodelay and timeout options

#Registering cores ---Ubuntu---
cores=2
library('doRedis')
options('redis:num'=TRUE)
registerDoRedis("jobs", nodelay=FALSE)
startLocalWorkers(n=cores, "jobs", timeout=2, nodelay=FALSE)
foreachOpt <- list(preschedule=FALSE)


#Registering cores ---Win---
#cores=2
#library("doSNOW")
#registerDoSNOW(makeCluster(cores, type = "SOCK"))


#defining the iterator
iterator <- function(x, ...) {
  i <- 1
  it <- idiv(length(x), ...)

  if(exists("chunks")){
    nextEl <- function() {
      n <- nextElem(it)
      ix <- seq(i, length=n)
      i <<- i + n
      x[ix]
    }
  }else{
    nextEl <- function() {
      n <- nextElem(it)
      ix <- seq(i, i+n-1)
      i <<- i + n
      x[ix]
    }
  }
  obj <- list(nextElem=nextEl)
  class(obj) <- c(
    'ivector', 'abstractiter','iter')
  obj
}

#reading files
names_files<-list.files()
html_list<-lapply(names_files, read_html)

#creating iterator
ChunkSize_html_list<-2
iter<-iterator(html_list, chunkSize=ChunkSize_html_list)

#defining expanding list (thanks StackOverflow and many thanks to
#JanKanis's answer : http://stackoverflow.com/questions/2436688/append-an-object-to-a-list-in-r-in-amortized-constant-time-o1  )
expanding_list <- function(capacity = 10) {
  buffer <- vector('list', capacity)
  length <- 0

  methods <- list()

  methods$double.size <- function() {
    buffer <<- c(buffer, vector('list', capacity))
    capacity <<- capacity * 2
  }

  methods$add <- function(val) {
    if(length == capacity) {
      methods$double.size()
    }

    length <<- length + 1
    buffer[[length]] <<- val
  }

  methods$as.list <- function() {
    b <- buffer[0:length]
    return(b)
  }

  methods
}

#parallelized part
clean_data<-foreach(ite=iter, .packages=c("itertools", "rvest"), .combine=c,
 .options.multicore=foreachOpt, .options.redis=list(chunkSize=1)) %dopar% {

  temp_tot <- expanding_list()
      for(g in 1:length(ite)){

        #extraction of data from tables
      tables <- html_table(ite[[g]], fill=T, header = T)

        for(i in 1:length(tables)){

          #just some basic data manipulation
          temp<-lapply(tables, function(d){d[nrow(d),]})
          temp_tot$add(temp)
          rm(temp)
          gc(verbose = F)
        }
      }
  #returning the list of cleaned data.frames to the foreach 
    temp_tot$as.list()
}

我认为问题在于，您正在通过调用read_HTML在master上创建XML/HTML文档对象，然后在worker上处理它们。我尝试了一些实验，但似乎不起作用，可能是因为这些对象无法序列化、发送到工作对象，然后再正确地反序列化。我认为对象已损坏，导致工作人员在尝试使用html_table函数对其进行操作时出错

我建议您修改代码以迭代文件名，这样工作人员就可以自己调用read_html，从而避免序列化XML文档对象

下面是我试验过的一些测试代码：

library(xml2)
library(snow)
cl <- makeSOCKcluster(3)
clusterEvalQ(cl, library(xml2))

# Create XML documents on the master
docs <- lapply(1:10,
      function(i) read_xml(paste0("<foo>", i, "</foo>")))

# Call xml_path on XML documents created on master
r1 <- lapply(docs, xml_path)            # correct results
r2 <- clusterApply(cl, docs, xml_path)  # incorrect results

# This seems to work...
docs2 <- clusterApply(cl, 1:10,
      function(i) read_xml(paste0("<foo>", i, "</foo>")))

# But this causes a segfault on the master
print(docs2)

我直接使用了snow函数来验证问题不在foreach或doSNOW中。

恭喜你提出了第一个问题，欢迎来到stackoverflow。我认为你在这里太聪明了。我看不出使用闭包的理由。你为什么需要这个扩展列表？显然，您知道列表需要有多大，所以只需使用vectormode=list，length=lengthtables预先分配它们即可。首先，为了澄清问题，temp_tot收集了所有表的最后一行，所有页面的i-loop，g-loop和每个页面的表的数量是未知的。我发现这可以通过c在g循环结束时解决，即使用您的代码在I循环中创建的列表。第二个反对意见，使我更喜欢expanding.list，是因为c-way的嵌套结构，第一个索引继承自g-index，第二个索引继承自i-index，expanding.list避免了这一点。Steve，非常感谢您的时间，您的回答确实是一个重大进步，我现在可以开始清理了。使用clusterApplycl，1:10，functioni html_tableread_xmlpase0，i，fill=T，将rvest导出到集群，将返回一个可访问的数据列表。frames！。我对你的doc2也有错误，也许是打印xml的问题？然而，在问题中，我通过阅读master来增加页面内容，以允许其他机器加入doRedis案例中的作业，同时，如果我错了，请纠正我，这是不可能与您的strategy@dgdi我认为将read_html函数从master移到workers并没有从根本上改变工作安排的方式，但也许我遗漏了一些东西。

library(xml2)
library(snow)
cl <- makeSOCKcluster(3)
clusterEvalQ(cl, library(xml2))

# Create XML documents on the master
docs <- lapply(1:10,
      function(i) read_xml(paste0("<foo>", i, "</foo>")))

# Call xml_path on XML documents created on master
r1 <- lapply(docs, xml_path)            # correct results
r2 <- clusterApply(cl, docs, xml_path)  # incorrect results

# This seems to work...
docs2 <- clusterApply(cl, 1:10,
      function(i) read_xml(paste0("<foo>", i, "</foo>")))

# But this causes a segfault on the master
print(docs2)