Scala 如何在RDD[Vector]中为K-Means算法转换合成控制数据集

Scala 如何在RDD[Vector]中为K-Means算法转换合成控制数据集,scala,apache-spark,bigdata,Scala,Apache Spark,Bigdata,我试图转换Uci机器学习中可用的“合成控制图时间序列数据集” 数据集的外观是下一个 28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337 34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27 30.7326 29.5054 33.0292 25.04

我试图转换Uci机器学习中可用的“合成控制图时间序列数据集”

数据集的外观是下一个

28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337  34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27   30.7326 29.5054 33.0292 25.04   28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747  31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717
24.8923 25.741  27.5532 32.8217 27.8789 31.5926 31.4861 35.5469 27.9516 31.6595 27.5415 31.1887 27.4867 31.391  27.811  24.488  27.5918 35.6273 35.4102 31.4167 30.7447 24.1311 35.1422 30.4719 31.9874 33.6615 25.5511 30.4686 33.6472 25.0701 34.0765 32.5981 28.3038 26.1471 26.9414 31.5203 33.1089 24.1491 28.5157 25.7906 35.9519 26.5301 24.8578 25.9562 32.8357 28.5322 26.3458 30.6213 28.9861 29.4047 32.5577 31.0205 26.6418 28.4331 33.6564 26.4244 28.4661 34.2484 32.1005 26.691
31.3987 30.6316 26.3983 24.2905 27.8613 28.5491 24.9717 32.4358 25.2239 27.3068 31.8387 27.2587 28.2572 26.5819 24.0455 35.0625 31.5717 32.5614 31.0308 34.1202 26.9337 31.4781 35.0173 32.3851 24.3323 30.2001 31.2452 26.6814 31.5137 28.8778 27.3086 24.246  26.9631 25.2919 31.6114 24.7131 27.4809 24.2075 26.8059 35.1253 32.6293 31.0561 26.3583 28.0861 31.4391 27.3057 29.6082 35.9725 34.1444 27.1717 33.6318 26.5966 25.5387 32.5434 25.5772 29.9897 31.351  33.9002 29.5446 29.343
数据存储在一个ASCII文件中,600行60列,每行一个图表。每个块的编号用空格分隔,每个块用换行符分隔。我必须为行转换所有60个数字的行,并将其存储在RDD[Vector]中。向量上的所有位置必须有60个数字。RDD[Vector]将具有以下外观

[28.7812 34.4632 31.3381 31.2834 28.9207 33.7596 25.3969 27.7849 35.2479 27.1159 32.8717 29.2171 36.0253 32.337  34.5249 32.8717 34.1173 26.5235 27.6623 26.3693 25.7744 29.27   30.7326 29.5054 33.0292 25.04   28.9167 24.3437 26.1203 34.9424 25.0293 26.6311 35.6541 28.4353 29.1495 28.1584 26.1927 33.3182 30.9772 27.0443 35.5344 26.2353 28.9964 32.0036 31.0558 34.2553 28.0721 28.9402 35.4973 29.747  31.4333 24.5556 33.7431 25.0466 34.9318 34.9879 32.4721 33.3759 25.4652 25.8717]
[24.8923 25.741  27.5532 32.8217 27.8789 31.5926 31.4861 35.5469 27.9516 31.6595 27.5415 31.1887 27.4867 31.391  27.811  24.488  27.5918 35.6273 35.4102 31.4167 30.7447 24.1311 35.1422 30.4719 31.9874 33.6615 25.5511 30.4686 33.6472 25.0701 34.0765 32.5981 28.3038 26.1471 26.9414 31.5203 33.1089 24.1491 28.5157 25.7906 35.9519 26.5301 24.8578 25.9562 32.8357 28.5322 26.3458 30.6213 28.9861 29.4047 32.5577 31.0205 26.6418 28.4331 33.6564 26.4244 28.4661 34.2484 32.1005 26.691]
我试图转换数据,但有一个例外。代码是这样的

val data = sc.textFile("/home/david/Desktop/synthetic.txt")
val parsedData = data.map(s => Vectors.dense(s.split("\n").map(_.toDouble))).cache()
当我运行K-Means算法时,我有一个例外

val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
这是个例外

16/06/07 19:56:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.NumberFormatException: empty String
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1020)
    at java.lang.Double.parseDouble(Double.java:540)
    at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
    at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
    at org.test.spark.RunKMeans$$anonfun$1$$anonfun$apply$1.apply(RunKMeans.scala:22)
    at org.test.spark.RunKMeans$$anonfun$1$$anonfun$apply$1.apply(RunKMeans.scala:22)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at org.test.spark.RunKMeans$$anonfun$1.apply(RunKMeans.scala:22)
    at org.test.spark.RunKMeans$$anonfun$1.apply(RunKMeans.scala:22)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:283)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
我怎样才能解决这个问题? 非常感谢。

您正在按\n(新行)拆分,假设您按空格拆分,并将每个元素转换为Double,然后才能将其密集为Mllib向量

因此,请将s.split(“\n”)更改为s.split(”),或者改用\s

==========编辑=========

由于要在多个空间上拆分,因此应使用:

 split("\\s+")

它将拆分为单个空格和多个空格。

我编辑了代码,但现在有了其他异常-java.lang.NumberFormatException:empty String。数据中是否有尾随空格?另外,每行是否包含相同数量的双倍数?@davidebegarcia然后向我们展示您编辑的代码,我们无法猜测您做了哪些更改。正如GameOfThrows所说,这可能是一个空白。您有多个数字之间有多个空格,因此请检查您的拆分处理。
val data=sc.textFile(“/home/david/Desktop/synthetic.txt”)val parsedData=data.map(s=>Vectors.dense(s.split(“”.map(u.toDouble)).cache()
。数据集由600个块组成,每个块有60个数字。每个块由一个换行符分隔到其他块,但块中的数字由一个或多个空格分隔。有时,数字有多个空格,因为所有数字的大小都相同。它已经在工作了。下面是完整的代码执行
import org.apache.spark.mllib.clustering.{KMeans,KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
val data=sc.textFile(“/home/david/Desktop/synthetic.txt”)Vectors.densite(s.split(\\s+).toDouble)).cache()//使用KMeans
val numClusters=2
val numIterations=20
val clusters=KMeans.train(parsedData,numClusters,numIterations)将数据分为两类。
。非常感谢@GameOfThrows。