Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/heroku/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 流水线中的火花存取估计器_Apache Spark_Pipeline - Fatal编程技术网

Apache spark 流水线中的火花存取估计器

Apache spark 流水线中的火花存取估计器,apache-spark,pipeline,Apache Spark,Pipeline,类似于,我想访问估算器,例如管道中的最后一个元素 上面提到的方法似乎不再适用于spark 2.0.1。现在它是如何工作的 编辑 也许我应该更详细地解释一下: 这是我的估计器+向量汇编程序: val numRound = 20 val numWorkers = 4 val xgbBaseParams = Map( "max_depth" -> 10, "eta" -> 0.1, "seed" -> 50, "silent" -> 1,

类似于,我想访问估算器,例如管道中的最后一个元素

上面提到的方法似乎不再适用于spark 2.0.1。现在它是如何工作的

编辑 也许我应该更详细地解释一下: 这是我的估计器+向量汇编程序:

val numRound = 20
val numWorkers = 4
val xgbBaseParams = Map(
    "max_depth" -> 10,
    "eta" -> 0.1,
    "seed" -> 50,
    "silent" -> 1,
    "objective" -> "binary:logistic"
  )

val xgbEstimator = new XGBoostEstimator(xgbBaseParams)
    .setFeaturesCol("features")
    .setLabelCol("label")

val vectorAssembler = new VectorAssembler()
    .setInputCols(train.columns
      .filter(!_.contains("label")))
    .setOutputCol("features")

  val simplePipeParams = new ParamGridBuilder()
    .addGrid(xgbEstimator.round, Array(numRound))
    .addGrid(xgbEstimator.nWorkers, Array(numWorkers))
    .build()

   val simplPipe = new Pipeline()
    .setStages(Array(vectorAssembler, xgbEstimator))

  val numberOfFolds = 2
  val cv = new CrossValidator()
    .setEstimator(simplPipe)
    .setEvaluator(new BinaryClassificationEvaluator()
      .setLabelCol("label")
      .setRawPredictionCol("prediction"))
    .setEstimatorParamMaps(simplePipeParams)
    .setNumFolds(numberOfFolds)
    .setSeed(gSeed)

  val cvModel = cv.fit(train)
  val trainPerformance = cvModel.transform(train)
  val testPerformance = cvModel.transform(test)
现在我想执行自定义评分,例如
!=0.5
截止点。如果我掌握了模型,这是可能的:

val realModel = cvModel.bestModel.asInstanceOf[XGBoostClassificationModel]
 val pipelineModel: Option[PipelineModel] = cvModel.bestModel match {
    case p: PipelineModel => Some(p)
    case _ => None
  }

  val realModel: Option[XGBoostClassificationModel] = pipelineModel
    .flatMap {
      _.stages.collect { case t: XGBoostClassificationModel => t }
        .headOption
    }
  // TODO write it nicer
  val measureResults = realModel.map {
    rm =>
      {
        for (
          thresholds <- Array(Array(0.2, 0.8), Array(0.3, 0.7), Array(0.4, 0.6),
            Array(0.6, 0.4), Array(0.7, 0.3), Array(0.8, 0.2))
        ) {
          rm.setThresholds(thresholds)

          val predResult = rm.transform(test)
            .select("label", "probabilities", "prediction")
            .as[LabelledEvaluation]
          println("cutoff was ", thresholds)
          calculateEvaluation(R, predResult)
        }
      }
  }
但这一步并没有编译。 由于您的建议,我可以获得以下型号:

val realModel = cvModel.bestModel.asInstanceOf[XGBoostClassificationModel]
 val pipelineModel: Option[PipelineModel] = cvModel.bestModel match {
    case p: PipelineModel => Some(p)
    case _ => None
  }

  val realModel: Option[XGBoostClassificationModel] = pipelineModel
    .flatMap {
      _.stages.collect { case t: XGBoostClassificationModel => t }
        .headOption
    }
  // TODO write it nicer
  val measureResults = realModel.map {
    rm =>
      {
        for (
          thresholds <- Array(Array(0.2, 0.8), Array(0.3, 0.7), Array(0.4, 0.6),
            Array(0.6, 0.4), Array(0.7, 0.3), Array(0.8, 0.2))
        ) {
          rm.setThresholds(thresholds)

          val predResult = rm.transform(test)
            .select("label", "probabilities", "prediction")
            .as[LabelledEvaluation]
          println("cutoff was ", thresholds)
          calculateEvaluation(R, predResult)
        }
      }
  }
将失败,因为
序列
不包含
矢量汇编程序
的功能列。 此列仅在运行完整管道时创建

所以我决定创建第二条管道:

val scoringPipe = new Pipeline()
            .setStages(Array(vectorAssembler, rm))
val predResult = scoringPipe.fit(train).transform(test)

但这似乎有点笨拙。您有更好的/更好的想法吗?

Spark 2.0.0没有任何变化,同样的方法也适用:

和型号:

val logRegModel = model.stages.last
  .asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel]

我相信您正在寻找的是
pipeline.getStages()
,它以数组的形式返回所有阶段。然后,您可以访问任何您想要的阶段。更多信息请参见。Ms问题的可能重复之处在于,我希望在交叉验证中使用管道,例如,估算器嵌套两次。A
cvModel.bestModel.getStages
不起作用。那么,我如何获得交叉验证程序的管道呢?那么它是的一个副本。