Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark SPARK、ML、Tuning、CrossValidator:访问度量_Apache Spark_Apache Spark Mllib_Apache Spark Ml - Fatal编程技术网

Apache spark SPARK、ML、Tuning、CrossValidator:访问度量

Apache spark SPARK、ML、Tuning、CrossValidator:访问度量,apache-spark,apache-spark-mllib,apache-spark-ml,Apache Spark,Apache Spark Mllib,Apache Spark Ml,为了构建NaiveBayes多类分类器,我正在使用CrossValidator选择管道中的最佳参数: val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) .setEvaluator(new MulticlassClassificationEvaluator) .setNumFolds(10) val cvMo

为了构建NaiveBayes多类分类器,我正在使用CrossValidator选择管道中的最佳参数:

val cv = new CrossValidator()
        .setEstimator(pipeline)
        .setEstimatorParamMaps(paramGrid)
        .setEvaluator(new MulticlassClassificationEvaluator)
        .setNumFolds(10)

val cvModel = cv.fit(trainingSet)
管道包含以下顺序的常用转换器和估计器:标记器、StopWordsRever、HashingTF、IDF,最后是NaiveBayes

是否可以访问为最佳模型计算的指标

理想情况下,我希望访问所有模型的度量,以了解参数的更改如何改变分类的质量。 但就目前而言,最好的模型已经足够好了

仅供参考,我正在使用Spark 1.6.0

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, stopWordsFilter, tf, idf, word2Vec, featureVectorAssembler, categoryIndexerModel, classifier, categoryReverseIndexer))

...

val paramGrid = new ParamGridBuilder()
  .addGrid(tf.numFeatures, Array(10, 100))
  .addGrid(idf.minDocFreq, Array(1, 10))
  .addGrid(word2Vec.vectorSize, Array(200, 300))
  .addGrid(classifier.maxDepth, Array(3, 5))
  .build()

paramGrid.size // 16 entries

...

// Print the average metrics per ParamGrid entry
val avgMetricsParamGrid = crossValidatorModel.avgMetrics

// Combine with paramGrid to see how they affect the overall metrics
val combined = paramGrid.zip(avgMetricsParamGrid)

...

val bestModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel]

// Explain params for each stage
val bestHashingTFNumFeatures = bestModel.stages(2).asInstanceOf[HashingTF].explainParams
val bestIDFMinDocFrequency = bestModel.stages(3).asInstanceOf[IDFModel].explainParams
val bestWord2VecVectorSize = bestModel.stages(4).asInstanceOf[Word2VecModel].explainParams
val bestDecisionTreeDepth = bestModel.stages(7).asInstanceOf[DecisionTreeClassificationModel].explainParams

在pyspark 2.2.0中工作,但是我真的不喜欢它,因为它假设了CrossValidator如何工作的内部知识。他们可能会改变度量数组的构建方式,以便在下一个版本中以不同的顺序运行,这样您就可以使用它了,但不知道您的代码是否已被使用,因为您的代码仍然有效。我希望返回模型的参数及其度量。我还希望看到汇总统计数据,而不仅仅是平均值。没有标准偏差的平均值有多有用?
 cvModel.avgMetrics