Python h2o集成抛出错误:“0”;基本模型不保留交叉验证预测”;

Python h2o集成抛出错误:“0”;基本模型不保留交叉验证预测”;,python,h2o,ensemble-learning,Python,H2o,Ensemble Learning,我正试图从大量GLM、GBM和深度学习模型中创建H2O中的集成模型 以下是我到目前为止所做的 导入相关库: import h2o from h2o.estimators.glm import H2OGeneralizedLinearEstimator from h2o.estimators.gbm import H2OGradientBoostingEstimator from h2o.estimators.deeplearning import H2ODeepLearningEstimator

我正试图从大量GLM、GBM和深度学习模型中创建H2O中的集成模型

以下是我到目前为止所做的

导入相关库:

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
数据可从下载。进口:

分为培训/测试集:

airlines_80,airlines_20 = airlines.split_frame(ratios=[.8], destination_frames=["airlines_80.hex", "airlines_20.hex"])
定义变量(预测y作为x中所有列的函数):

设置公共属性:

folds=5
assignment_type="Modulo"
search_criteria={'strategy': 'RandomDiscrete', 'max_models': 5, 'seed': 1}
使用H2O的网格搜索创建各种模型:

# GLM
glm_params = {"alpha": [0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.],
              "lambda": [0, 1e-7, 1e-5, 1e-3, 1e-1]}

glm_grid = H2OGridSearch(model=H2OGeneralizedLinearEstimator(fold_assignment=assignment_type, nfolds=folds),
                         grid_id='glm_grid',
                         hyper_params=glm_params,
                         search_criteria=search_criteria)
glm_grid.train(x=x,
               y=y,
               training_frame=airlines_80,
               validation_frame=airlines_20)

# GBM
gbm_params = {'learn_rate': [0.01, 0.03],
              'max_depth': [3, 4, 5, 6, 9],
              'sample_rate': [0.7, 0.8, 0.9, 1],
              'col_sample_rate': [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]}

gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator(fold_assignment=assignment_type, nfolds=folds),
                         grid_id='gbm_grid',
                         hyper_params=gbm_params,
                         search_criteria=search_criteria)
gbm_grid.train(x=x,
               y=y,
               training_frame=airlines_80,
               validation_frame=airlines_20)

# Deep learning
dl_params = {'activation': ['rectifier', 'rectifier_with_dropout'],
             'hidden': [[10,10], [20,15], [50,50,50]],
             'l1': [0, 1e-3, 1e-5],
             'l2': [0, 1e-3, 1e-5]}

dl_grid = H2OGridSearch(model=H2ODeepLearningEstimator(fold_assignment=assignment_type, nfolds=folds),
                        grid_id='dl_grid',
                        hyper_params=dl_params,
                        search_criteria=search_criteria)

dl_grid.train(x=x,
              y=y,
              training_frame=airlines_80,
              validation_frame=airlines_20)
获取所有型号标识的列表:

all_model_ids = glm_grid.model_ids + gbm_grid.model_ids + dl_grid.model_ids
我尝试创建合奏的地方:

ensemble = H2OStackedEnsembleEstimator(base_models=all_model_ids)
ensemble.train(x=x, y=y, training_frame=airlines_80, validation_frame=airlines_20)
。。。将引发以下错误:

stackedensemble Model Build progress: | (failed)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-26-bc7b6094816f> in <module>()
      1 ensemble = H2OStackedEnsembleEstimator(base_models=all_model_ids)
----> 2 ensemble.train(x=x, y=y, training_frame=airlines_80, validation_frame=airlines_20)

/anaconda3/lib/python3.6/site-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
    235             return
    236 
--> 237         model.poll(verbose_model_scoring_history=verbose)
    238         model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0]
    239         self._resolve_model(model.dest_key, model_json)

/anaconda3/lib/python3.6/site-packages/h2o/job.py in poll(self, verbose_model_scoring_history)
     75             if (isinstance(self.job, dict)) and ("stacktrace" in list(self.job)):
     76                 raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: "
---> 77                                        "\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))
     78             else:
     79                 raise EnvironmentError("Job with key %s failed with an exception: %s" % (self.job_key, self.exception))

OSError: Job with key $03017f00000132d4ffffffff$_a2359a38ec8d31316aee91398f0249f8 failed with an exception: water.exceptions.H2OIllegalArgumentException: Base model does not keep cross-validation predictions: 5
stacktrace: 
water.exceptions.H2OIllegalArgumentException: Base model does not keep cross-validation predictions: 5
    at hex.StackedEnsembleModel.checkAndInheritModelProperties(StackedEnsembleModel.java:382)
    at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeImpl(StackedEnsemble.java:234)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:218)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1395)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
StackedAssemble模型构建进度:|(失败)
---------------------------------------------------------------------------
OSError回溯(最近一次调用上次)
在()
1集合=集合估计量(基本模型=所有模型ID)
---->2.整体训练(x=x,y=y,训练框架=airlines\U 80,验证框架=airlines\U 20)
/anaconda3/lib/python3.6/site-packages/h2o/estimators/estimators\u base.py in train(self,x,y,training\u frame,offset\u column,fold\u column,weights\u column,validation\u frame,max\u runtime\u secs,ignored\u column,model\u id,verbose)
235返回
236
-->237 model.poll(详细模型评分历史=详细)
238 model_json=h2o.api(“GET/%d/Models/%s”%(rest_ver,model.dest_key))[“Models”][0]
239 self.\u resolve\u model(model.dest\u key,model\u json)
/民意测验中的anaconda3/lib/python3.6/site-packages/h2o/job.py(自我、详细模型、评分历史)
75如果(isinstance(self.job,dict))和列表(self.job)中的(“stacktrace”):
76 raise环境错误(“具有{}键的作业失败,出现异常:{}\n堆栈跟踪:”
--->77“\n{}”.format(self.job_key,self.exception,self.job[“stacktrace”]))
78其他:
79 raise环境错误(“密钥为%s的作业失败,出现异常:%s”%(self.Job\u key,self.exception))
OSError:项为$03017F00000132D4FFFFFF$_a2359a38ec8d31316aee91398f0249f8的作业失败,出现异常:water.exceptions.H2OIllegalArgumentException:基本模型未保留交叉验证预测:5
堆栈跟踪:
water.exceptions.H2OIllegalArgumentException:基础模型未保留交叉验证预测:5
在hex.StackedEnsembleModel.checkAndInheritModelProperties中(StackedEnsembleModel.java:382)
在hex.essemble.stackedAssemble$stackedAssembleDriver.computeImpl(stackedAssemble.java:234)处
位于hex.ModelBuilder$Driver.compute2(ModelBuilder.java:218)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1395)
在jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
位于jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
位于jsr16y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
位于jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
在jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

你能看出我做错了什么吗?

看起来你在每个模型中都缺少参数
keep\u cross\u validation\u predictions=True
。例如,您希望对GLM执行以下操作,然后对要堆叠的其他模型执行类似操作:

glm_grid = H2OGridSearch(model=H2OGeneralizedLinearEstimator(fold_assignment=assignment_type, nfolds=folds,
    keep_cross_validation_predictions=True),
                                 grid_id='glm_grid',
                                 hyper_params=glm_params,
                                 search_criteria=search_criteria)

看起来您在每个模型中都缺少参数
keep\u cross\u validation\u predictions=True
。例如,您希望对GLM执行以下操作,然后对要堆叠的其他模型执行类似操作:

glm_grid = H2OGridSearch(model=H2OGeneralizedLinearEstimator(fold_assignment=assignment_type, nfolds=folds,
    keep_cross_validation_predictions=True),
                                 grid_id='glm_grid',
                                 hyper_params=glm_params,
                                 search_criteria=search_criteria)

这不应该发生,因为默认情况下,我们会保留CV预测(并且您没有关闭该选项)。您使用的是什么版本的H2O?如果它不是最新的稳定版本,您可以升级并重试吗?我记得我们在网格搜索中默认关闭了保存CV Pred的功能,只花了一小段时间,但后来意识到了这个问题并修复了它,所以我希望如果您进行升级,这将解决问题。谢谢@Erin。很荣幸:我刚才在YouTube上看了你的一个演讲。我使用的是H2O版本3.22.1.2,根据
H2O.init()
的输出,它已经运行了15天。我看看是否有更新的版本。阿洛斯,我瞥了一眼Flow中的模型,看到交叉验证模型就在那里(例如,
dl\u grid\u model\u 1
dl\u grid\u model\u cv\u 1
dl\u grid\u model\u 1\u cv\u 2
,等等…@Erin:FYI…我将H2O更新为最新和最棒的版本,目前为3.22.1.3,并遇到了相同的错误。我会尝试找出这里发生了什么。Lauren在下面解决了它…我忘了
保持交叉验证的predictions
默认情况下在常规H2O算法中关闭(我想到的是H2O AutoML,默认情况下它是打开的)。这不应该发生,因为默认情况下我们保留CV预测(并且您没有关闭它).您使用的是哪个版本的H2O?如果它不是最新的稳定版本,您可以升级并重试吗?我记得我们在网格搜索中默认关闭了保存CV Pred的功能一段时间,但后来意识到了问题并修复了它,所以我希望您的升级能够解决问题。谢谢@Erin。这是一种荣誉:我刚刚观看了一个我正在使用H2O版本3.22.1.2,根据
H2O.init()
的输出,它已经有15天了。我会看看是否有更新的版本可用。Alos,我看了一下Flow中的模型,看看是否有交叉验证模型(例如,
dl\u grid\u model\u 1
dl\u grid\u model\u cv\u 1
dl\u grid\u model\u 1\u cv\u 2
,等等…@Erin:FYI…我将H2O更新为最新和最棒的版本,目前为3.22.1.3,并遇到了相同的错误。我会尝试找出这里发生了什么。Lauren在下面解决了它…我忘了
保持交叉验证的predictions
在常规H2O算法中默认为关闭(我想到的是H2O AutoML,在其中默认为打开)
glm_grid = H2OGridSearch(model=H2OGeneralizedLinearEstimator(fold_assignment=assignment_type, nfolds=folds,
    keep_cross_validation_predictions=True),
                                 grid_id='glm_grid',
                                 hyper_params=glm_params,
                                 search_criteria=search_criteria)