Scala 避免在Spark中使用ReduceByKey洗牌
我正在参加有关Scala Spark的coursera课程,我正在尝试优化此片段:Scala 避免在Spark中使用ReduceByKey洗牌,scala,apache-spark,Scala,Apache Spark,我正在参加有关Scala Spark的coursera课程,我正在尝试优化此片段: val indexedMeansG = vectors. map(v =>
val indexedMeansG = vectors.
map(v => findClosest(v, means) -> v).
groupByKey.mapValues(averageVectors)
vectors
是一个RDD[(Int,Int)]
,为了查看依赖项列表和我使用的RDD沿袭:
println(s"""GroupBy:
| Deps: ${indexedMeansG.dependencies.size}
| Deps: ${indexedMeansG.dependencies}
| Lineage: ${indexedMeansG.toDebugString}""".stripMargin)
这表明:
/* GroupBy:
* Deps: 1
* Deps: List(org.apache.spark.OneToOneDependency@44d1924)
* Lineage: (6) MapPartitionsRDD[18] at mapValues at StackOverflow.scala:207 []
* ShuffledRDD[17] at groupByKey at StackOverflow.scala:207 []
* +-(6) MapPartitionsRDD[16] at map at StackOverflow.scala:206 []
* MapPartitionsRDD[13] at map at StackOverflow.scala:139 []
* CachedPartitions: 6; MemorySize: 84.0 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
* MapPartitionsRDD[12] at values at StackOverflow.scala:116 []
* MapPartitionsRDD[11] at mapValues at StackOverflow.scala:115 []
* MapPartitionsRDD[10] at groupByKey at StackOverflow.scala:92 []
* MapPartitionsRDD[9] at join at StackOverflow.scala:91 []
* MapPartitionsRDD[8] at join at StackOverflow.scala:91 []
* CoGroupedRDD[7] at join at StackOverflow.scala:91 []
* +-(6) MapPartitionsRDD[4] at map at StackOverflow.scala:88 []
* | MapPartitionsRDD[3] at filter at StackOverflow.scala:88 []
* | MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* | src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* | src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 []
* +-(6) MapPartitionsRDD[6] at map at StackOverflow.scala:89 []
* MapPartitionsRDD[5] at filter at StackOverflow.scala:89 []
* MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 [] */
从这个列表(org.apache.spark)。OneToOneDependency@44d1924)
我推断没有进行洗牌,对吗?然而,下面的shuffleddd[17]被打印出来,这意味着实际上存在shuffling
我试图用一个reduceByKey
来替换groupByKey
调用,如下所示:
val indexedMeansR = vectors.
map(v => findClosest(v, means) -> v).
reduceByKey((a, b) => (a._1 + b._1) / 2 -> (a._2 + b._2) / 2)
其依赖关系和血统是:
/* ReduceBy:
* Deps: 1
* Deps: List(org.apache.spark.ShuffleDependency@4d5e813f)
* Lineage: (6) ShuffledRDD[17] at reduceByKey at StackOverflow.scala:211 []
* +-(6) MapPartitionsRDD[16] at map at StackOverflow.scala:210 []
* MapPartitionsRDD[13] at map at StackOverflow.scala:139 []
* CachedPartitions: 6; MemorySize: 84.0 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
* MapPartitionsRDD[12] at values at StackOverflow.scala:116 []
* MapPartitionsRDD[11] at mapValues at StackOverflow.scala:115 []
* MapPartitionsRDD[10] at groupByKey at StackOverflow.scala:92 []
* MapPartitionsRDD[9] at join at StackOverflow.scala:91 []
* MapPartitionsRDD[8] at join at StackOverflow.scala:91 []
* CoGroupedRDD[7] at join at StackOverflow.scala:91 []
* +-(6) MapPartitionsRDD[4] at map at StackOverflow.scala:88 []
* | MapPartitionsRDD[3] at filter at StackOverflow.scala:88 []
* | MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* | src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* | src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 []
* +-(6) MapPartitionsRDD[6] at map at StackOverflow.scala:89 []
* MapPartitionsRDD[5] at filter at StackOverflow.scala:89 []
* MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 [] */
这一次,依赖关系是ShuffleDependency,我不明白为什么
由于RDD是一对,键是int,因此有一个顺序,我也尝试修改分区器并使用
RangePartitioner
,但它也没有改善一个reduceByKey
操作仍然涉及一个洗牌,因为仍然需要确保具有相同密钥的所有项成为同一分区的一部分
然而,这将是一个比
groupByKey
操作小得多的洗牌操作。reduceByKey
将在洗牌之前在每个分区内执行缩减操作,从而减少要洗牌的数据量。但是,根据依赖项的输出,groupByKey
具有OneToOneDependency
,它不涉及洗牌,而reduceByKey
具有shuffledependence
,这涉及到洗牌。为什么?OneToOneDependency
对应于mapValues
调用,而不是groupByKey
调用。如果删除了该选项,您应该注意到shuffledependence
。另外,请注意groupByKey
沿袭中的shuffleddd
。