Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark中的K-means(Scala)-当模型由标准化数据生成时,如何将集群编号映射回客户ID_Scala_Hadoop_Apache Spark_K Means - Fatal编程技术网

Spark中的K-means(Scala)-当模型由标准化数据生成时,如何将集群编号映射回客户ID

Spark中的K-means(Scala)-当模型由标准化数据生成时,如何将集群编号映射回客户ID,scala,hadoop,apache-spark,k-means,Scala,Hadoop,Apache Spark,K Means,下面的代码用于获取模型。我面临的问题是将集群编号映射回客户ID。这是因为,我的模型是在标准化数据上训练的,但是具有客户ID的数据具有未标准化的数据。我想不出如何映射回去 import org.apache.spark.SparkContext._ import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors import scala.colle

下面的代码用于获取模型。我面临的问题是将集群编号映射回客户ID。这是因为,我的模型是在标准化数据上训练的,但是具有客户ID的数据具有未标准化的数据。我想不出如何映射回去

import org.apache.spark.SparkContext._
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.util.MLUtils
// importing the data for clustering
val data = sc.textFile("hdfs://path/data_for_clus1") 
val vectors = data.map(s => s.split('\1')).map(s => s.slice(1, s.size)) 
val parsedData =  vectors.map(s => Vectors.dense(s.map(_.toDouble)))    

val dataAsArray = parsedData.map(_.toArray)  
// Using Standardscaler to standardize data
val features = dataAsArray.map(a => Vectors.dense(a))
val scaler = new StandardScaler(withMean = true, withStd = true).fit(features) 
val scaledFeatures = scaler.transform(features) 


val WSSEBuffer = ArrayBuffer[Double](); 
// K-means
val numClusters = 20
val numIterations = 500
val clusters = KMeans.train(scaledFeatures, numClusters, numIterations)
val WSSSE = clusters.computeCost(scaledFeatures)
使用“集群”模型,我想为“数据”表中的客户ID提供集群编号。

将数据解析为

val newdata = Array[(customerID, featureArray)]
然后

不确定这是否是一种有效的方法

将数据解析为

val newdata = Array[(customerID, featureArray)]
然后

不确定这是否是一种有效的方法