R 当响应变量不在测试集中时,h2o预测有时会失败

R 当响应变量不在测试集中时,h2o预测有时会失败,r,h2o,R,H2o,当在不存在响应变量的测试集上进行预测时,如果在训练中对因子变量使用一个热编码,则h2o会以各种不同的方式失败,无论是在训练GLM时隐式指定,还是在其他方法中显式指定 该错误出现在R3.4.0和h2o 3.12.0.1中。我们还使用h2o 3.10.3.3进行了测试 library(h2o) localH2O = h2o.init() prostatePath = system.file("extdata", "prostate.csv", package = "h2o") prostate.

当在不存在响应变量的测试集上进行预测时,如果在训练中对因子变量使用一个热编码,则h2o会以各种不同的方式失败,无论是在训练GLM时隐式指定,还是在其他方法中显式指定

该错误出现在R3.4.0和h2o 3.12.0.1中。我们还使用h2o 3.10.3.3进行了测试

 library(h2o)
localH2O = h2o.init()

prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = read.csv(prostatePath)
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:380),1))))

prostate.hex<-as.h2o(prostate.hex)
prostate.hex$weight<-1

prostate_train<-prostate.hex[1:300,]
prostate_test<-prostate.hex[301:380,]
prostate_test<-prostate_test[,-3] #delete response variable from test data

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,offset_column="weight")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train)
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.gbm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,categorical_encoding = "OneHotExplicit")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)
GBM示例产生此错误(即使测试数据中唯一缺少的列是响应变量):

该错误似乎特定于因子变量,并显式使用一个热编码。可以通过向测试数据集添加一个“假”响应列来解决这个问题(我们已经测试过了,正如我们预期的那样,该列的值对预测没有影响),但这显然不理想

如果有5个或5个以上的因子水平,则即使列车和测试集中存在所有因子水平,误差仍然存在:

prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:5),76))))

prostate.hex$我们的_factorSame问题是Python中的报告者:h2o bugtracker链接:
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
    at hex.Model.adaptTestForTrain(Model.java:1028)
    at hex.Model.adaptTestForTrain(Model.java:854)
    at hex.Model.score(Model.java:1072)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Error: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:5),76))))