Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用于Spark Scala的ML管道_Scala_Apache Spark - Fatal编程技术网

用于Spark Scala的ML管道

用于Spark Scala的ML管道,scala,apache-spark,Scala,Apache Spark,我有一个具有以下结构的数据帧(df): 数据 label pa_age pa_gender_category 10000 32.0 male 25000 36.0 female 45000 68.0 female 15000 24.0 male 目标 val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer, featureCreater, rf, labelConverter)

我有一个具有以下结构的数据帧(df):

数据

label pa_age pa_gender_category
10000 32.0   male
25000 36.0   female
45000 68.0   female
15000 24.0   male
目标

val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))
我想为“label”列构建一个随机森林分类器,其中“pa_age”列和“pa_gender_category”列是特征

遵循流程

// Transform the labels column into labels index

val labelIndexer = new StringIndexer().setInputCol("label")
.setOutputCol("indexedLabel").fit(df)

// Transform column gender_category into labels

val featureTransformer = new StringIndexer().setInputCol("pa_gender_category")
.setOutputCol("pa_gender_category_label").fit(df)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)
上述步骤的预期输出:

label pa_age pa_gender_category indexedLabel pa_gender_category_label
10000 32.0   male               1.0          1.0
25000 36.0   female             2.0          2.0
45000 68.0   female             3.0          2.0
10000 24.0   male               1.0          1.0
现在我需要将数据转换为“标签”和“功能”格式

val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)
管道

val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))
问题

error: value fit is not a member of org.apache.spark.ml.feature.VectorAssembler
       val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category_label")).setOutputCol("features").fit(df)
  • 基本上,这是将数据转换为标签和特征的步骤 我正面临麻烦

  • 我的流程/管道是否正确

    • 问题就在这里

      val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
      .setOutputCol("features").fit(df)
      
      您不能在这里调用
      fit(df)
      ,因为
      vectorsembler
      没有方法
      fit
      。不要忘记在
      StringIndexer
      IndexToString
      中删除
      .fit(df)
      。管道初始化后,对管道对象调用
      fit
      方法

      val model = pipeline.fit(df)
      
      现在,管道将遍历您提供给它的每个算法

      StringIndexer
      没有属性
      标签
      ,请使用
      getOutputCol
      代替它