Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/assembly/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用mlr包预处理用于预测的新数据_R_Mlr - Fatal编程技术网

如何使用mlr包预处理用于预测的新数据

如何使用mlr包预处理用于预测的新数据,r,mlr,R,Mlr,如果我想使用mlr包对新数据进行预测,如何预处理新数据,以便使用原始数据预处理所需的信息。例如,如果我合并了小因子水平,并且新数据集中的频率与第一个数据集中的频率不同,则产生的因子水平可能不同,并且无法进行预测。注:我在这里假设,在训练模型时,新数据还不可用,这不是关于测试数据,而是关于预测新数据。那么,新数据的预处理应该如何在mlr中完成呢?下面是一个示例,我创建了一个新任务来预处理导致错误的新数据集: library(mlr) a <- data.frame(y=factor(c(1,

如果我想使用mlr包对新数据进行预测,如何预处理新数据,以便使用原始数据预处理所需的信息。例如,如果我合并了小因子水平,并且新数据集中的频率与第一个数据集中的频率不同,则产生的因子水平可能不同,并且无法进行预测。注:我在这里假设,在训练模型时,新数据还不可用,这不是关于测试数据,而是关于预测新数据。那么,新数据的预处理应该如何在mlr中完成呢?下面是一个示例,我创建了一个新任务来预处理导致错误的新数据集:

library(mlr)
a <- data.frame(y=factor(c(1,1,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(10,1,1)))
# most frequent x1 factor is "a"
aTask <- makeClassifTask(data = a, target = "y", positive="1")
aTask <- mergeSmallFactorLevels(aTask, cols=c("x1"), min.perc=0.1)
# combines "b" and "c" into factor ".merged"
getTaskData(aTask)

aLearner <- makeLearner("classif.rpart", predict.type = "prob")
model <- train(aLearner, aTask)

b <- data.frame(y=factor(c(1,0,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(1,10,1)))
# most frequent x1 factor is "b"
# target would be made up, because at this stage there would be now target
# variable availabel
newdataTask <- makeClassifTask(data = b, target = "y", positive="1")
newdataTask <- mergeSmallFactorLevels(newdataTask, cols="x1", 
                                      min.perc = 0.1)
# combines "a" and "c" into factor ".merged"
getTaskData(newdataTask)

pred <- predict(model, newdataTask)

#Error in model.frame.default(Terms, newdata, na.action = na.action, 
#                              xlev = attr(object,  : 
#Faktor 'x1' hat neue Stufen b (= factor 'x1' has new level b)
库(mlr)

mlr不提供任何自动执行此操作的功能,但您可以轻松检查哪些因子级别已被替换,并在新数据中相应重命名:

library(plyr)
to.replace = setdiff(levels(b$x1), levels(getTaskData(aTask)$x1))
b$x1 = mapvalues(b$x1, from = to.replace, to = rep(".merged", times = length(to.replace)))
完整示例:

library(mlr)
a = data.frame(y=factor(c(1,1,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(10,1,1)))
aTask = makeClassifTask(data = a, target = "y", positive="1")
aTask = mergeSmallFactorLevels(aTask, cols=c("x1"), min.perc=0.1)

aLearner = makeLearner("classif.rpart", predict.type = "prob")
model = train(aLearner, aTask)

b = data.frame(y=factor(c(1,0,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(1,10,1)))
library(plyr)
to.replace = setdiff(levels(b$x1), levels(getTaskData(aTask)$x1))
b$x1 = mapvalues(b$x1, from = to.replace, to = rep(".merged", times = length(to.replace)))

newdataTask = makeClassifTask(data = b, target = "y", positive="1")

pred = predict(model, newdataTask)
对于这样的事情,最好是将学习者和预处理融合在一起,以便在训练和预测时透明地、自动地进行。在本例中,完整的示例如下所示:

lrn = makeLearner("classif.rpart", predict.type = "prob")
trainfun = function(data, target, args) {
    task = makeClassifTask(data = data, target = target, positive = "1")
    new.task = mergeSmallFactorLevels(task, cols = c("x1"), min.perc = 0.1)
    return(list(data = getTaskData(new.task), control = list(levels(getTaskData(new.task)$x1))))
}
predictfun = function(data, target, args, control) {
    library(plyr)
    to.replace = setdiff(levels(data$x1), control[[1]])
    data$x1 = mapvalues(data$x1, from = to.replace, to = rep(".merged", times = length(to.replace)))
    return(data)
}
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun)

a = data.frame(y=factor(c(1,1,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(10,1,1)))
aTask = makeClassifTask(data = a, target = "y", positive="1")
model = train(lrn, aTask)

b = data.frame(y=factor(c(1,0,1,1,1,1,1,1,0,0,1,0)), 
                x1=rep(c("a","b", "c"), times=c(1,10,1)))
newdataTask = makeClassifTask(data = b, target = "y", positive = "1")
pred = predict(model, newdataTask)
这只是一个概念证明——您可能希望有参数来指定应该处理哪些功能以及阈值应该是什么,并调整
predictfun
代码以处理任意数量的已处理功能