Apache spark 使用spark从KMeans的文本数据生成向量
我不熟悉火花和机器学习。我正在尝试使用KMeans对一些数据进行集群,比如Apache spark 使用spark从KMeans的文本数据生成向量,apache-spark,machine-learning,Apache Spark,Machine Learning,我不熟悉火花和机器学习。我正在尝试使用KMeans对一些数据进行集群,比如 1::Hi How are you 2::I am fine, how about you 在数据中,分隔符是::,集群的实际文本是包含文本数据的第二列。 在阅读了spark官方页面和大量文章后,我编写了以下代码,但我无法生成向量作为KMeans.train步骤的输入 import org.apache.spark.SparkConf import org.apache.spark.SparkContext impor
1::Hi How are you
2::I am fine, how about you
在数据中,分隔符是::,集群的实际文本是包含文本数据的第二列。
在阅读了spark官方页面和大量文章后,我编写了以下代码,但我无法生成向量作为KMeans.train步骤的输入
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
val sc = new SparkContext("local", "test")
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val rawData = sc.textFile("data/mllib/KM.txt").map(line => line.split("::")(1))
val sentenceData = rawData.toDF("sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val clusters = KMeans.train(featurizedData, 2, 10)
我得到以下错误
<console>:27: error: type mismatch;
found : org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val clusters = KMeans.train(featurizedData, 2, 10)
请建议如何处理KMeans的输入数据
提前感谢。在替换以下代码后,我终于让它工作了
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val clusters = KMeans.train(featurizedData, 2, 10)
与
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val kmeans = new KMeans().setK(2).setFeaturesCol("features").setPredictionCol("prediction")
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))