Apache spark spark中管道的交叉验证
从管道外部进行交叉验证Apache spark spark中管道的交叉验证,apache-spark,pipeline,apache-spark-mllib,cross-validation,apache-spark-ml,Apache Spark,Pipeline,Apache Spark Mllib,Cross Validation,Apache Spark Ml,从管道外部进行交叉验证 val naivebayes val indexer val pipeLine = new Pipeline().setStages(Array(indexer, naiveBayes)) val paramGrid = new ParamGridBuilder() .addGrid(naiveBayes.smoothing, Array(1.0, 0.1, 0.3, 0.5)) .build() val crossValidator = new Cross
val naivebayes
val indexer
val pipeLine = new Pipeline().setStages(Array(indexer, naiveBayes))
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(1.0, 0.1, 0.3, 0.5))
.build()
val crossValidator = new CrossValidator().setEstimator(pipeLine)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
val crossValidatorModel = crossValidator.fit(trainData)
val predictions = crossValidatorModel.transform(testData)
管道内部的交叉验证
val naivebayes
val indexer
// param grid for multiple parameter
val paramGrid = new ParamGridBuilder()
.addGrid(naiveBayes.smoothing, Array(0.35, 0.1, 0.2, 0.3, 0.5))
.build()
// validator for naive bayes
val crossValidator = new CrossValidator().setEstimator(naiveBayes)
.setEvaluator(new MulticlassClassificationEvaluator)
.setNumFolds(2).setEstimatorParamMaps(paramGrid)
// pipeline to execute compound transformation
val pipeLine = new Pipeline().setStages(Array(indexer, crossValidator))
// pipeline model
val pipeLineModel = pipeLine.fit(trainData)
// transform data
val predictions = pipeLineModel.transform(testData)
所以我想知道哪种方法更好,以及它的利弊
对于这两个函数,我得到了相同的结果和精度。即使是第二种方法也比第一种方法快一点。根据我参加的培训,这应该是最佳实践:
cv = CrossValidator(estimator=lr,..)
pipelineModel = Pipeline(stages=[idx,assembler,cv])
cv_model= pipelineModel.fit(train)
这样,您的管道将只适合一次,而不是每次使用param_网格重复运行,从而使其运行更快。
希望这有帮助