Scala 将Spark RDD转换为数据集
在一些文本挖掘之后,我试图进行kmean聚类,但是我找不到如何在kmean.fit方法所需的数据集中转换ParseWikipedia.termDocumentMatrix的结果Scala 将Spark RDD转换为数据集,scala,apache-spark,rdd,apache-spark-dataset,Scala,Apache Spark,Rdd,Apache Spark Dataset,在一些文本挖掘之后,我试图进行kmean聚类,但是我找不到如何在kmean.fit方法所需的数据集中转换ParseWikipedia.termDocumentMatrix的结果 scala> val (termDocMatrix, termIds, docIds, idfs) = ParseWikipedia.termDocumentMatrix(lemmas, stopWords, numTerms, sc) scala> val kmeans = new KMeans().set
scala> val (termDocMatrix, termIds, docIds, idfs) = ParseWikipedia.termDocumentMatrix(lemmas, stopWords, numTerms, sc)
scala> val kmeans = new KMeans().setK(5).setMaxIter(200).setSeed(1L)
scala> termDocMatrix.take(1)
res24: Array[org.apache.spark.mllib.linalg.Vector] = Array((1000,[32,166,200,223,577,645,685,873,926],[0.18132966949934762,0.3777537726516676,0.3178848913768969,0.43380819546465704,0.30604090845847254,0.46007361524957147,0.2076406414508386,0.2995665853335863,0.1742843713808876]))
scala> val modele = kmeans.fit(termDocMatrix)
<console>:66: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
required: org.apache.spark.sql.Dataset[_]
val modele = kmeans.fit(termDocMatrix)
(并尝试将kmeans.fit分别与之匹配)。唯一给出不同错误的是termDocVectors
val modele = kmeans.fit(termDocVectors)
18/01/05 01:14:52 ERROR Executor: Exception in task 0.0 in stage 560.0 (TID 1682)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: org.apache.spark.mllib.linalg.SparseVector is not a valid external type for schema of vector
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else newInstance(class org.apache.spark.ml.linalg.VectorUDT).serialize AS features#75
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
有人有线索吗?
谢谢你的帮助
此外,在测试提供的线索后:
我可以在哪里应用TOD?
scala> termDocMatrix.toDS
<console>:69: error: value toDS is not a member of org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
termDocMatrix.toDS
最初的问题似乎已经解决了。现在我面临一个新的问题,我从一个mllib.Rowmatrix计算VD,kmeans似乎在等待ml向量。我只需要找到如何在ml包中计算SVD…Spark的Dataset API没有为org.apache.Spark.mllib.linalg.Vector提供编码器。也就是说,您可以尝试将MLlib向量的RDD转换为
数据集
,方法是首先将向量映射到Tuple1
中,如下面的示例所示,以查看您的ML模型是否接受该向量:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val termDocMatrix = sc.parallelize(Array(
Vectors.sparse(
1000, Array(32, 166, 200, 223, 577, 645, 685, 873, 926), Array(
0.18132966949934762, 0.3777537726516676, 0.3178848913768969,
0.43380819546465704, 0.30604090845847254, 0.46007361524957147,
0.2076406414508386, 0.2995665853335863, 0.1742843713808876
)),
Vectors.sparse(
1000, Array(74, 154, 343, 405, 446, 538, 566, 612 ,732), Array(
0.12128098267647237, 0.2499114848264329, 0.1626128536458679,
0.12167467201712565, 0.2790928578869498, 0.24904429178306794,
0.10039172907499895, 0.22803472531961744, 0.36408630055671115
))
))
// termDocMatrix: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ...
val ds = spark.createDataset(termDocMatrix.map(Tuple1.apply)).
withColumnRenamed("_1", "features")
// ds: org.apache.spark.sql.Dataset[(org.apache.spark.mllib.linalg.Vector,)] = [features: vector]
ds.show
// +--------------------+
// | features|
// +--------------------+
// |(1000,[32,166,200...|
// |(1000,[74,154,343...|
// +--------------------+
toDS怎么样谢谢你的建议。。。我已经完成了这个问题……以防您不知道,这里有一些相关的内容。@Jice,您的ML模型似乎希望看到名为
features
的列。我已经更新了我的答案。我重新更新了。你的建议似乎有效,但我仍然面临转换问题。无论如何,我找到了解决问题的方法:我将termDocMatrix保存在csv文件中,然后以良好的结构重新加载它。不完全令人满意,但如果我找不到更好的方法,至少我可以继续。谢谢。谢谢@Jice的全面更新。有趣的是,保存/重新加载为CSV有助于解决问题。
scala> termDocMatrix.toDS
<console>:69: error: value toDS is not a member of org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
termDocMatrix.toDS
val ds = spark.createDataset(termDocMatrix.map(Tuple1.apply)).withColumnRenamed("_1", "features")
ds: org.apache.spark.sql.DataFrame = [features: vector]
scala> val modele = kmeans.fit(ds)
java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val termDocMatrix = sc.parallelize(Array(
Vectors.sparse(
1000, Array(32, 166, 200, 223, 577, 645, 685, 873, 926), Array(
0.18132966949934762, 0.3777537726516676, 0.3178848913768969,
0.43380819546465704, 0.30604090845847254, 0.46007361524957147,
0.2076406414508386, 0.2995665853335863, 0.1742843713808876
)),
Vectors.sparse(
1000, Array(74, 154, 343, 405, 446, 538, 566, 612 ,732), Array(
0.12128098267647237, 0.2499114848264329, 0.1626128536458679,
0.12167467201712565, 0.2790928578869498, 0.24904429178306794,
0.10039172907499895, 0.22803472531961744, 0.36408630055671115
))
))
// termDocMatrix: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ...
val ds = spark.createDataset(termDocMatrix.map(Tuple1.apply)).
withColumnRenamed("_1", "features")
// ds: org.apache.spark.sql.Dataset[(org.apache.spark.mllib.linalg.Vector,)] = [features: vector]
ds.show
// +--------------------+
// | features|
// +--------------------+
// |(1000,[32,166,200...|
// |(1000,[74,154,343...|
// +--------------------+