Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark Scala中Dunn指数的计算_Scala_Apache Spark_Apache Spark Sql_Apache Spark Mllib_Apache Spark Ml - Fatal编程技术网

Spark Scala中Dunn指数的计算

Spark Scala中Dunn指数的计算,scala,apache-spark,apache-spark-sql,apache-spark-mllib,apache-spark-ml,Scala,Apache Spark,Apache Spark Sql,Apache Spark Mllib,Apache Spark Ml,我试图计算Dunn索引来衡量我的Kmeans集群在以下数据集上的性能 V1 V2 V3 V4 V5 -0.80688767 2.580938e-01 -2.310133e-01 -0.69172608 0.76195996 -0.80871432 5.357830e-01 -2.320617e-01 -1.09496541 0.71935607 -0

我试图计算Dunn索引来衡量我的Kmeans集群在以下数据集上的性能

         V1            V2            V3          V4          V5        
   -0.80688767  2.580938e-01 -2.310133e-01 -0.69172608  0.76195996  
   -0.80871432  5.357830e-01 -2.320617e-01 -1.09496541  0.71935607  
   -0.79147152 -6.051847e-01 -9.574660e-02 -1.02494869  0.89793288  
   -0.77096829 -1.859497e+00 -4.956332e-01 -0.77016532  1.20462390  
   -0.67800192 -1.595468e+00 -7.405667e-01 -0.89351545  0.92360485  
   -0.62255535  1.167977e+00 -1.656397e-01 -0.59319708  1.20205692  
   -0.81017300 -1.234912e+00 -5.714762e-01 -0.86877635  0.32971553  
   -0.72079901  5.085883e-01 -5.726607e-01 -0.91749111  0.46749543  
   -0.87377368 -5.650047e-01 -1.437415e-01 -0.65893811  0.61737109  
下面是我计算邓恩指数的代码

import org.apache.spark.mllib.clustering.KMeans

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.mllib.linalg.{Vector, Vectors}

import org.apache.spark.rdd.RDD

val sc = spark.sparkContext

val data = sc.textFile("/FileStore/tables/sample.csv", 16)

val dataRDD = data
      .map(s => s.split(",")
        .map(_.toDouble))
      .keyBy(_.apply(0))
      .cache()

val parsedData = dataRDD.map(s => Vectors.dense(s._2)).cache()

val clusters = KMeans.train(parsedData,2,100)

//Global Center
val centroides = sc.parallelize(clusters.clusterCenters)

val centroidesCartesian = centroides.cartesian(centroides).filter(x => x._1 != x._2).cache()

// DUNN

val minA = centroidesCartesian.map(x => Vectors.sqdist(x._1, x._2)).min()

val maxB = parsedData.map( r => Vectors.sqdist(r, clusters.clusterCenters(clusters.predict(r)))).max

//Get Dunn index
val dunn = minA / maxB
我在计算“maxB”时出错。 错误为“org.apache.spark.SparkException:任务不可序列化”

这是产生错误的代码行

val maxB=parsedData.map(r=>Vectors.sqdist(r,clusters.clusterCenters(clusters.predict(r))).max

你知道怎么解决这个问题吗


另外,我很乐意知道是否有更好的方法来计算Spark Scala中的Dunn索引?

您正在尝试在其他转换中执行转换。我的意思是
.map
.predict
步骤。此外,您正在传递到
.predict
single
Vector
而不是
RDD[Vector]
而不是方法所期望的
def predict(points:RDD[Vector]):RDD[Int]
。谢谢@chlebek。是否有一种更干净的方法来分离这些转换并仍然获得所需的结果?我可以通过使用collect方法创建单个数组来运行它。类似这样的东西,但我不认为它对于更大的数据集是可伸缩的。val maxB=parsedData.collect.map(r=>Vectors.sqdist(r,clusters.clusterCenters(clusters.predict(r))).max