用于Spark Scala的ML管道
我有一个具有以下结构的数据帧(df): 数据用于Spark Scala的ML管道,scala,apache-spark,Scala,Apache Spark,我有一个具有以下结构的数据帧(df): 数据 label pa_age pa_gender_category 10000 32.0 male 25000 36.0 female 45000 68.0 female 15000 24.0 male 目标 val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer, featureCreater, rf, labelConverter)
label pa_age pa_gender_category
10000 32.0 male
25000 36.0 female
45000 68.0 female
15000 24.0 male
目标
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))
我想为“label”列构建一个随机森林分类器,其中“pa_age”列和“pa_gender_category”列是特征
遵循流程
// Transform the labels column into labels index
val labelIndexer = new StringIndexer().setInputCol("label")
.setOutputCol("indexedLabel").fit(df)
// Transform column gender_category into labels
val featureTransformer = new StringIndexer().setInputCol("pa_gender_category")
.setOutputCol("pa_gender_category_label").fit(df)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
上述步骤的预期输出:
label pa_age pa_gender_category indexedLabel pa_gender_category_label
10000 32.0 male 1.0 1.0
25000 36.0 female 2.0 2.0
45000 68.0 female 3.0 2.0
10000 24.0 male 1.0 1.0
现在我需要将数据转换为“标签”和“功能”格式
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)
管道
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))
问题
error: value fit is not a member of org.apache.spark.ml.feature.VectorAssembler
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category_label")).setOutputCol("features").fit(df)
- 基本上,这是将数据转换为标签和特征的步骤 我正面临麻烦
- 我的流程/管道是否正确
- 问题就在这里
val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)
您不能在这里调用fit(df)
,因为vectorsembler
没有方法fit
。不要忘记在StringIndexer
和IndexToString
中删除.fit(df)
。管道初始化后,对管道对象调用fit
方法
val model = pipeline.fit(df)
现在,管道将遍历您提供给它的每个算法
StringIndexer
没有属性标签
,请使用getOutputCol
代替它