Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用于多个输入文件的Rhadoop mapreduce_R_Hadoop_Mapreduce_Rhadoop - Fatal编程技术网

用于多个输入文件的Rhadoop mapreduce

用于多个输入文件的Rhadoop mapreduce,r,hadoop,mapreduce,rhadoop,R,Hadoop,Mapreduce,Rhadoop,我正在使用R构建一个mapreduce程序,该程序使用遗传算法从数据集中的一组特征中提取相关特征。我需要将许多文件作为mapreduce作业的输入。下面的代码是我的mapreduce程序,但它只适用于一个输入文件(data.csv) 我将文件放在hdfs的一个文件夹中 hadoop fs -copyFromLocal /home/rania/Downloads/matrices/*.csv /user/rania/genetic/data/ 这是map函数 mon.map <- func

我正在使用R构建一个mapreduce程序,该程序使用遗传算法从数据集中的一组特征中提取相关特征。我需要将许多文件作为mapreduce作业的输入。下面的代码是我的mapreduce程序,但它只适用于一个输入文件(data.csv)

我将文件放在hdfs的一个文件夹中

hadoop fs -copyFromLocal /home/rania/Downloads/matrices/*.csv /user/rania/genetic/data/
这是map函数

mon.map <- function(.,data){ 
data <- read.csv("/home/rania/Downloads/dataset.csv", header = T, sep = ";")
y <- c(1,0,1,0,1,1,1,1,0,0,1,0,1)

ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv")   # 10 fold cross validation
set.seed(10)
lev <- c("1","0")
rf_ga3 <- gafs(x = data, y = y,
                           iters = 10, # 100 generations of algorithm
                           popSize = 4, # population size for each generation
                           levels = lev,
                           gafsControl = ga_ctrl)
keyval(rf_ga3$ga$final, data[names(data) %in% rf_ga3$ga$final]  ) 
 }
我试图更改map函数以使其适用于许多文件,但失败了

mon.map <- function(.,data){ 
data <- list.files(path="/home/rania/Downloads/matrices/", full.names=TRUE, pattern="\\.csv") %>% lapply(read.csv, header=TRUE, sep=",") 
y <- c(1,0,1,0,1,1,1,1,0,0,1,0,1)
for (i in 1:4){
ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv")   # 10 fold cross validation
set.seed(10)
lev <- c("1","0")
rf_ga3 <- gafs(x = data[[i]], y = y,
                           iters = 10, # 100 generations of algorithm
                           popSize = 4, # population size for each generation
                           levels = lev,
                           gafsControl = ga_ctrl)
               }
keyval(rf_ga3$ga$final, do.call(cbind, Map(`[`, data, c(rf_ga3$ga$final)))  )  
 }

mon.map我认为模式参数中应该使用“*.csv”。Hadoop很容易处理文件路径模式。这是一个错误:
if(nrow(x)!=length(y))stop中的错误(“x和y中应该有相同数量的样本):参数的长度为零,但是如果我检查每个数据的行数[[i]]和y的长度,我发现它是相同的。会有什么问题?@user238607
mon.reduce <- function(k,v){
keyval(k,v) }
hdfs.root = 'genetic' 
hdfs.data = file.path(hdfs.root, 'data')
hdfs.out = file.path(hdfs.root, 'out')
csv.format <- make.output.format("csv")
genetic = function (input, output) {mapreduce(input=input, output=output, input.format="csv",output.format=csv.format, map=mon.map,reduce=mon.reduce)}
out = genetic(hdfs.data, hdfs.out)
results <- from.dfs(out, format="csv")
print(results) 
hdfs.cat("/genetic/out/part-00000")
mon.map <- function(.,data){ 
data <- list.files(path="/home/rania/Downloads/matrices/", full.names=TRUE, pattern="\\.csv") %>% lapply(read.csv, header=TRUE, sep=",") 
y <- c(1,0,1,0,1,1,1,1,0,0,1,0,1)
for (i in 1:4){
ga_ctrl <- gafsControl(functions = rfGA, # Assess fitness with RF
                       method = "cv")   # 10 fold cross validation
set.seed(10)
lev <- c("1","0")
rf_ga3 <- gafs(x = data[[i]], y = y,
                           iters = 10, # 100 generations of algorithm
                           popSize = 4, # population size for each generation
                           levels = lev,
                           gafsControl = ga_ctrl)
               }
keyval(rf_ga3$ga$final, do.call(cbind, Map(`[`, data, c(rf_ga3$ga$final)))  )  
 }