Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Pyspark的交叉验证度量_Apache Spark_Pyspark_Apache Spark Mllib_Cross Validation - Fatal编程技术网

Apache spark Pyspark的交叉验证度量

Apache spark Pyspark的交叉验证度量,apache-spark,pyspark,apache-spark-mllib,cross-validation,Apache Spark,Pyspark,Apache Spark Mllib,Cross Validation,当我们进行k倍交叉验证时,我们正在测试一个模型在预测从未见过的数据时表现得有多好 如果将我的数据集分成90%的训练和10%的测试,并分析模型性能,则无法保证我的测试集不包含10%的“最容易”或“最难”预测点 通过进行10倍交叉验证,我可以确保每个点至少用于一次培训。由于(在本例中)将对模型进行10次测试,我们可以对这些测试指标进行分析,这将使我们更好地了解模型在分类新数据方面的表现 交叉验证是一种优化算法超参数的方法,其目的是检查模型 通过这样做: lr = LogisticRegression

当我们进行k倍交叉验证时,我们正在测试一个模型在预测从未见过的数据时表现得有多好

如果将我的数据集分成90%的训练和10%的测试,并分析模型性能,则无法保证我的测试集不包含10%的“最容易”或“最难”预测点

通过进行10倍交叉验证,我可以确保每个点至少用于一次培训。由于(在本例中)将对模型进行10次测试,我们可以对这些测试指标进行分析,这将使我们更好地了解模型在分类新数据方面的表现

交叉验证是一种优化算法超参数的方法,其目的是检查模型

通过这样做:

lr = LogisticRegression(maxIter=10, tol=1E-4)
ovr = OneVsRest(classifier=lr)
pipeline = Pipeline(stages=[... , ovr])

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=10)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(df)
据我所知,我能够获得一个模型,该模型具有在paramGrid中定义的最佳参数集。我理解这种超参数调整的价值,但我想要的是分析模型性能,而不仅仅是获得最佳模型

问题是(对于这种情况下的10倍交叉验证):

是否可以使用CrossValidator为10个测试中的每一个(或每个度量的10个测试的平均值)提取度量(f1、精度、召回率等)?即,是否可以使用CrossValidator进行模型检查而不是模型选择

谢谢


更新
如评论中所述,可以找到类似的问题。第一个建议是在拟合之前将collectSubModels设置为true,这会抛出一个错误,表示关键字不存在(老实说,我没有花很多时间试图找出原因)

用户在其回答中提供了一种打印中间培训结果的变通方法。使用他提供的方法,可以打印评估指标的中间结果。由于我想提取精度、召回率、f1和混淆矩阵的中间结果,我对他实施的方法做了一些修改:

TestResult = collections.namedtuple("TestResult", ["params", "metrics"])

class CrossValidatorVerbose(CrossValidator):

    def _fit(self, dataset):
        folds = []
        est = self.getOrDefault(self.estimator)
        epm = self.getOrDefault(self.estimatorParamMaps)
        numModels = len(epm)

        eva = self.getOrDefault(self.evaluator)
        metricName = eva.getMetricName()
        nFolds = self.getOrDefault(self.numFolds)
        seed = self.getOrDefault(self.seed)
        h = 1.0 / nFolds

        randCol = self.uid + "_rand"
        df = dataset.select("*", rand(seed).alias(randCol))
        metrics = [0.0] * numModels

        for i in range(nFolds):
            folds.append([])
            foldNum = i + 1
            print("Comparing models on fold %d" % foldNum)

            validateLB = i * h
            validateUB = (i + 1) * h
            condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
            validation = df.filter(condition)
            train = df.filter(~condition)

            for j in range(numModels):
                paramMap = epm[j]
                model = est.fit(train, paramMap)
                # TODO: duplicate evaluator to take extra params from input
                prediction = model.transform(validation, paramMap)
                metric = eva.evaluate(prediction)
                metrics[j] += metric

                avgSoFar = metrics[j] / foldNum
                print("params: %s\t%s: %f\tavg: %f" % (
                    {param.name: val for (param, val) in paramMap.items()},
                    metricName, metric, avgSoFar))
                
                predictionLabels = prediction.select("prediction", "label")
                allMetrics = MulticlassMetrics(predictionLabels.rdd)
                folds[i].append(TestResult(paramMap.items(), allMetrics))
                

        if eva.isLargerBetter():
            bestIndex = np.argmax(metrics)
        else:
            bestIndex = np.argmin(metrics)

        bestParams = epm[bestIndex]
        bestModel = est.fit(dataset, bestParams)
        avgMetrics = [m / nFolds for m in metrics]
        bestAvg = avgMetrics[bestIndex]
        print("Best model:\nparams: %s\t%s: %f" % (
            {param.name: val for (param, val) in bestParams.items()},
            metricName, bestAvg))

        return self._copyValues(CrossValidatorModel(bestModel, avgMetrics)), folds
要打印特定折叠的度量(使用第一组超参数打印第一个折叠):

将打印如下内容:

Class 0.0 precision = 0.809523809524
Class 0.0 recall = 0.772727272727
Class 0.0 F1 Measure = 0.790697674419

Class 1.0 precision = 0.857142857143
Class 1.0 recall = 0.818181818182
Class 1.0 F1 Measure = 0.837209302326

Class 2.0 precision = 0.875
Class 2.0 recall = 0.875
Class 2.0 F1 Measure = 0.875

...

Weighted recall = 0.808333333333
Weighted precision = 0.812411616162
Weighted F(1) Score = 0.808461689698
Weighted F(0.5) Score = 0.810428077222
Weighted false positive rate = 0.026335560185
Accuracy = 0.808333333333
您还可以检查可能的副本-其中一个答案可能对您有所帮助。
def printMetrics(metrics, df):
    labels = df.rdd.map(lambda lp: lp.label).distinct().collect()
    for label in sorted(labels):
        print("Class %s precision = %s" % (label, metrics.precision(label)))
        print("Class %s recall = %s" % (label, metrics.recall(label)))
        print("Class %s F1 Measure = %s" % (label, metrics.fMeasure(label, beta=1.0)))
        print ""

    # Weighted stats
    print("Weighted recall = %s" % metrics.weightedRecall)
    print("Weighted precision = %s" % metrics.weightedPrecision)
    print("Weighted F(1) Score = %s" % metrics.weightedFMeasure())
    print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5))
    print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate)
    print("Accuracy = %s" % metrics.accuracy)

printMetrics(folds[0][0].metrics, df)
Class 0.0 precision = 0.809523809524
Class 0.0 recall = 0.772727272727
Class 0.0 F1 Measure = 0.790697674419

Class 1.0 precision = 0.857142857143
Class 1.0 recall = 0.818181818182
Class 1.0 F1 Measure = 0.837209302326

Class 2.0 precision = 0.875
Class 2.0 recall = 0.875
Class 2.0 F1 Measure = 0.875

...

Weighted recall = 0.808333333333
Weighted precision = 0.812411616162
Weighted F(1) Score = 0.808461689698
Weighted F(0.5) Score = 0.810428077222
Weighted false positive rate = 0.026335560185
Accuracy = 0.808333333333