R 在工作进程之间共享对象_R_Parallel Processing

R 在工作进程之间共享对象

r parallel-processing

R 在工作进程之间共享对象,r,parallel-processing,R,Parallel Processing,我想在许多不同的工作进程上运行f（x），这些进程运行一台（多台）远程机器的加分，其中x是一个大对象我的交互式R会话在node0上运行，我使用并行库，因此我执行以下操作： library(parallel) cl <- makeCluster(rep("node1", times = 64)) clusterExport(cl, "x") clusterExport(cl, "f") clusterEvalQ(cl, f(x)) 库（并行） cl假设主机和远程主机之间的连接是瓶颈，您

我想在许多不同的工作进程上运行

f（x）

，这些进程运行一台（多台）远程机器的加分，其中

是一个大对象

我的交互式R会话在

node0

上运行，我使用

并行

库，因此我执行以下操作：

library(parallel)

cl <- makeCluster(rep("node1", times = 64))
clusterExport(cl, "x")
clusterExport(cl, "f")

clusterEvalQ(cl, f(x))

库（并行）
cl假设主机和远程主机之间的连接是瓶颈，您可以将一个副本传输给第一个工作进程，然后将其缓存到文件中，并让其他工作进程从该缓存文件读取数据。比如：
library("parallel")

## Large data object
x <- 1:1e6
f <- function(x) mean(x)

## All N=64 workers are on the same host
cl <- makeCluster(rep("node1", times = 64))

## Send function
clusterExport(cl, "f")

## Send data to first worker (over slow connection)
clusterExport(cl[1], "x")

## Save to cache file (on remote machine)
cachefile <- clusterEvalQ(cl[1], {
  saveRDS(x, file = (f <- tempfile())); f
})[[1]]

## Load cache file into remaining workers
clusterExport(cl[-1], "cachefile")
clusterEvalQ(cl[-1], { x <- readRDS(file = cachefile); TRUE })

# Resolve function on all workers
y <- clusterEvalQ(cl, f(x))

库（“并行”）
##大数据对象
x假设主机和远程主机之间的连接是瓶颈，您可以将一个副本传输到第一个工作进程，然后将其缓存到文件中，并让其他工作进程从该缓存文件读取数据。比如：
library("parallel")

## Large data object
x <- 1:1e6
f <- function(x) mean(x)

## All N=64 workers are on the same host
cl <- makeCluster(rep("node1", times = 64))

## Send function
clusterExport(cl, "f")

## Send data to first worker (over slow connection)
clusterExport(cl[1], "x")

## Save to cache file (on remote machine)
cachefile <- clusterEvalQ(cl[1], {
  saveRDS(x, file = (f <- tempfile())); f
})[[1]]

## Load cache file into remaining workers
clusterExport(cl[-1], "cachefile")
clusterEvalQ(cl[-1], { x <- readRDS(file = cachefile); TRUE })

# Resolve function on all workers
y <- clusterEvalQ(cl, f(x))

库（“并行”）
##大数据对象
x这里有一个使用FIFO的版本，我不确定它在Linux下的可移植性，我不确定它与@HenrikB
的anwser在性能方面的比较：
library(parallel)

# create a very large cluster on a single (remote) node:
cl <- makePSOCKcluster(3)

# create a very large object
o <- 1:10

# create a fifo on the node and retrieve the name
fifo_name <- clusterEvalQ(cl[1], {
                        fifo_name <- tempfile()
                        system2("mkfifo", fifo_name)
                        fifo_name
})[[1]]

# send the very large object to one process on the node and the name of the fifo to all nodes
clusterExport(cl[1], "o")
clusterExport(cl, "fifo_name")

# does the actual sharing through the fifo
# note that a fifo has to be opened for reading 
# before writing on it
for(i in 2:length(cl)) {
  clusterEvalQ(cl[i], { ff <- fifo(fifo_name, "rb")  })
  clusterEvalQ(cl[1], { ff <- fifo(fifo_name, "wb")
                        saveRDS(o, ff)
                        close(ff)                    })
  clusterEvalQ(cl[i], { o <- readRDS(ff)
                        close(ff)                    })
}

# cleanup
clusterEvalQ(cl[1], {   unlink(fifo_name)            })

# check if everything is there
clusterEvalQ(cl, exists("o"))

# now you can do the actual work
...

库（并行）
#在单个（远程）节点上创建非常大的群集：
cl这是一个使用FIFO
的版本，我不确定它在Linux下的可移植性，我不确定它与@HenrikB
的anwser在性能方面的比较：
library(parallel)

# create a very large cluster on a single (remote) node:
cl <- makePSOCKcluster(3)

# create a very large object
o <- 1:10

# create a fifo on the node and retrieve the name
fifo_name <- clusterEvalQ(cl[1], {
                        fifo_name <- tempfile()
                        system2("mkfifo", fifo_name)
                        fifo_name
})[[1]]

# send the very large object to one process on the node and the name of the fifo to all nodes
clusterExport(cl[1], "o")
clusterExport(cl, "fifo_name")

# does the actual sharing through the fifo
# note that a fifo has to be opened for reading 
# before writing on it
for(i in 2:length(cl)) {
  clusterEvalQ(cl[i], { ff <- fifo(fifo_name, "rb")  })
  clusterEvalQ(cl[1], { ff <- fifo(fifo_name, "wb")
                        saveRDS(o, ff)
                        close(ff)                    })
  clusterEvalQ(cl[i], { o <- readRDS(ff)
                        close(ff)                    })
}

# cleanup
clusterEvalQ(cl[1], {   unlink(fifo_name)            })

# check if everything is there
clusterEvalQ(cl, exists("o"))

# now you can do the actual work
...

库（并行）
#在单个（远程）节点上创建非常大的群集：
cl您确定每次迭代都会发送对象吗？对象会在clusterExport（cl，“x”）
行中一个接一个地发送到每个工作进程，这非常慢，因为它是通过网络连接进行的。只需发送一次，然后在内存中从一个工作进程复制到另一个工作进程就足够了。您确定每次迭代都会发送该对象吗？该对象会在clusterExport（cl，“x”）
行中一个接一个地发送到每个工作进程，这非常慢，因为它是通过网络连接发生的。只需发送一次，然后在内存中从一个工作进程复制到另一个工作进程就足够了。当然有效，我希望使用一些管道/插座魔法来避免通过硬盘。当然有效，我希望使用一些管道/插座魔法来避免通过硬盘。这对于数据失败。表，显然，这对任何对象类型都不起作用。请详细说明“对于数据表
…”声明。支持/解释这一点很重要，这样其他人就不会得出错误的结论。您是否只需要在worker上安装一个库（data.table）
？（这在中进行了解释）这对于数据.table
失败，显然这对于任何对象类型都不起作用。请展开“对于数据.table
…”声明。支持/解释这一点很重要，这样其他人就不会得出错误的结论。您是否只需要在worker上安装一个库（data.table）
？（这在中进行了解释）