Apache spark 使用spark从KMeans的文本数据生成向量

Apache spark 使用spark从KMeans的文本数据生成向量,apache-spark,machine-learning,Apache Spark,Machine Learning,我不熟悉火花和机器学习。我正在尝试使用KMeans对一些数据进行集群,比如 1::Hi How are you 2::I am fine, how about you 在数据中,分隔符是::,集群的实际文本是包含文本数据的第二列。 在阅读了spark官方页面和大量文章后,我编写了以下代码,但我无法生成向量作为KMeans.train步骤的输入 import org.apache.spark.SparkConf import org.apache.spark.SparkContext impor

我不熟悉火花和机器学习。我正在尝试使用KMeans对一些数据进行集群,比如

1::Hi How are you
2::I am fine, how about you
在数据中,分隔符是::,集群的实际文本是包含文本数据的第二列。 在阅读了spark官方页面和大量文章后,我编写了以下代码,但我无法生成向量作为KMeans.train步骤的输入

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val sc = new SparkContext("local", "test") 

val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val rawData = sc.textFile("data/mllib/KM.txt").map(line => line.split("::")(1))

val sentenceData = rawData.toDF("sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")

val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)

val clusters = KMeans.train(featurizedData, 2, 10)
我得到以下错误

<console>:27: error: type mismatch;
 found   : org.apache.spark.sql.DataFrame
    (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
       val clusters = KMeans.train(featurizedData, 2, 10)
请建议如何处理KMeans的输入数据


提前感谢。

在替换以下代码后,我终于让它工作了

val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)

val clusters = KMeans.train(featurizedData, 2, 10)

val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")

val kmeans = new KMeans().setK(2).setFeaturesCol("features").setPredictionCol("prediction")

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))