如何在spark scala中保存PCA对象?
我正在对我的数据进行PCA,我阅读了以下指南: 有关守则如下:如何在spark scala中保存PCA对象?,scala,apache-spark,pca,Scala,Apache Spark,Pca,我正在对我的数据进行PCA,我阅读了以下指南: 有关守则如下: import org.apache.spark.mllib.feature.PCA import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.rdd.RDD val data: RDD[LabeledPoint] = sc.parallelize
import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val data: RDD[LabeledPoint] = sc.parallelize(Seq(
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 1)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 1, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0)),
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0))))
// Compute the top 5 principal components.
val pca = new PCA(5).fit(data.map(_.features))
// Project vectors to the linear space spanned by the top 5 principal
// components, keeping the label
val projected = data.map(p => p.copy(features = pca.transform(p.features)))
此代码对数据执行PCA。但是,我找不到解释如何保存和加载拟合的PCA对象以供将来使用的示例代码或文档。有人能给我一个基于上述代码的示例吗?PCA mlib版本似乎不支持将模型保存到磁盘。您可以保存生成的PCAModel的pc矩阵。但是,请使用spar ML。它返回一个Spark估计器,该估计器可以序列化并包含在Spark ML管道中。基于@EmiCareOfCell44应答的示例代码,使用
PCA
和PCAModel
来自org.apache.Spark.ML.feature
:
import org.apache.spark.ml.feature.{PCA, PCAModel}
import org.apache.spark.ml.linalg.Vectors
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3)
.fit(df)
val result = pca.transform(df).select("pcaFeatures")
result.show(false)
// save the model
val savePath = "xxxx"
pca.save(savePath)
// load the save model
val pca_loaded = PCAModel.load(savePath)