Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何将org.apache.spark.rdd.rdd[Array[Double]]转换为spark MLlib所需的Array[Double]_Apache Spark_Apache Spark Mllib - Fatal编程技术网

Apache spark 如何将org.apache.spark.rdd.rdd[Array[Double]]转换为spark MLlib所需的Array[Double]

Apache spark 如何将org.apache.spark.rdd.rdd[Array[Double]]转换为spark MLlib所需的Array[Double],apache-spark,apache-spark-mllib,Apache Spark,Apache Spark Mllib,我正在尝试使用ApacheSpark实现KMeans val data = sc.textFile(irisDatasetString) val parsedData = data.map(_.split(',').map(_.toDouble)).cache() val clusters = KMeans.train(parsedData,3,numIterations = 20) 对此,我得到以下错误: error: overloaded method value train with

我正在尝试使用ApacheSpark实现
KMeans

val data = sc.textFile(irisDatasetString)
val parsedData = data.map(_.split(',').map(_.toDouble)).cache()

val clusters = KMeans.train(parsedData,3,numIterations = 20)
对此,我得到以下错误:

error: overloaded method value train with alternatives:
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int,initializationMode: String)org.apache.spark.mllib.clustering.KMeansModel
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]], Int, numIterations: Int)
       val clusters = KMeans.train(parsedData,3,numIterations = 20)
error: type Vector takes type parameters
   val vectorData: Vector = Vectors.dense(parsedData)
                   ^
error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]])
       val vectorData: Vector = Vectors.dense(parsedData)
在这一点上,我得到了以下错误:

error: overloaded method value train with alternatives:
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int,initializationMode: String)org.apache.spark.mllib.clustering.KMeansModel
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]], Int, numIterations: Int)
       val clusters = KMeans.train(parsedData,3,numIterations = 20)
error: type Vector takes type parameters
   val vectorData: Vector = Vectors.dense(parsedData)
                   ^
error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]])
       val vectorData: Vector = Vectors.dense(parsedData)
错误:类型向量采用类型参数
val vectorData:Vector=Vectors.densite(解析数据)
^
错误:重载方法值,并包含多个备选方案:
(值:Array[Double])org.apache.spark.mllib.linalg.Vector
(firstValue:Double,othervalue:Double*)org.apache.spark.mllib.linalg.Vector
无法应用于(org.apache.spark.rdd.rdd[Array[Double]])
val vectorData:Vector=Vectors.densite(解析数据)
因此我推断,
org.apache.spark.rdd.rdd[Array[Double]]
与Array[Double]不同


我如何继续使用我的数据作为org.apache.spark.rdd.rdd[Array[Double]]
?或者如何将org.apache.spark.rdd.rdd[Array[Double]]转换为Array[Double]

KMeans.train
期望的是
rdd[Vector]
,而不是
rdd[Array[Double]
。在我看来,你所需要做的就是改变

val parsedData = data.map(_.split(',').map(_.toDouble)).cache()


不,那不行。我现在得到了以下错误:错误:扩展函数((x$1)=>x$1.split(',').map(((x$2)=>x$2.toDouble)))val parsedData=data.map(Vectors.densite(.split(',')).map(.toDouble))。cache()我也尝试过了。因此,我得到类型为:
org.apache.spark.rdd.rdd[org.apache.spark.mllib.linalg.Vector]
的解析数据,然后我尝试使用:
val-dataArray=parsedData.collect val-dataVector=Vectors.dense(dataArray)
将其转换为向量,因为我的数据数组是
Array[org.apache.spark.mllib.linalg.Vector]
Vector.dense
需要一个
数组[Double]
为什么希望
RDD[Vector]
是一个向量
KMeans.train
需要一个
RDD[Vector]
。你是对的:)出于某种原因,我认为我必须收集数据,然后将其传递给k means。你的解决方案有效:)谢谢。嘿,爬山,你怎么用pyspark写同样的东西?我正在尝试获取CSV文件中数据的多变量统计信息。函数需要RDD[Vectors]。我不知道怎么弄到它们