Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/apache-flex/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 避免在Spark中使用ReduceByKey洗牌_Scala_Apache Spark - Fatal编程技术网

Scala 避免在Spark中使用ReduceByKey洗牌

Scala 避免在Spark中使用ReduceByKey洗牌,scala,apache-spark,Scala,Apache Spark,我正在参加有关Scala Spark的coursera课程,我正在尝试优化此片段: val indexedMeansG = vectors. map(v =>

我正在参加有关Scala Spark的coursera课程,我正在尝试优化此片段:

val indexedMeansG = vectors.                                                                                                                                                                 
       map(v => findClosest(v, means) -> v).                                                                                                                                                      
       groupByKey.mapValues(averageVectors)
vectors
是一个
RDD[(Int,Int)]
,为了查看依赖项列表和我使用的RDD沿袭:

println(s"""GroupBy:                                                                                                                                                                         
             | Deps: ${indexedMeansG.dependencies.size}                                                                                                                                           
             | Deps: ${indexedMeansG.dependencies}                                                                                                                                                
             | Lineage: ${indexedMeansG.toDebugString}""".stripMargin)
这表明:

/* GroupBy:                                                                                                                                                                                  
   * Deps: 1                                                                                                                                                                                      
   * Deps: List(org.apache.spark.OneToOneDependency@44d1924)                                                                                                                                      
   * Lineage: (6) MapPartitionsRDD[18] at mapValues at StackOverflow.scala:207 []                                                                                                                 
   *  ShuffledRDD[17] at groupByKey at StackOverflow.scala:207 []                                                                                                                                 
   * +-(6) MapPartitionsRDD[16] at map at StackOverflow.scala:206 []                                                                                                                              
   *  MapPartitionsRDD[13] at map at StackOverflow.scala:139 []                                                                                                                                   
   *      CachedPartitions: 6; MemorySize: 84.0 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B                                                                                                
   *  MapPartitionsRDD[12] at values at StackOverflow.scala:116 []                                                                                                                                
   *  MapPartitionsRDD[11] at mapValues at StackOverflow.scala:115 []                                                                                                                             
   *  MapPartitionsRDD[10] at groupByKey at StackOverflow.scala:92 []                                                                                                                             
   *  MapPartitionsRDD[9] at join at StackOverflow.scala:91 []                                                                                                                                    
   *  MapPartitionsRDD[8] at join at StackOverflow.scala:91 []                                                                                                                                    
   *  CoGroupedRDD[7] at join at StackOverflow.scala:91 []                                                                                                                                        
   *    +-(6) MapPartitionsRDD[4] at map at StackOverflow.scala:88 []                                                                                                                             
   *  |  MapPartitionsRDD[3] at filter at StackOverflow.scala:88 []                                                                                                                               
   *  |  MapPartitionsRDD[2] at map at StackOverflow.scala:69 []                                                                                                                                  
   *  |  src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []                                                                          
   *  |  src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 []                                                                                 
   *    +-(6) MapPartitionsRDD[6] at map at StackOverflow.scala:89 []                                                                                                                             
   *  MapPartitionsRDD[5] at filter at StackOverflow.scala:89 []                                                                                                                                  
   *  MapPartitionsRDD[2] at map at StackOverflow.scala:69 []                                                                                                                                     
   *  src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []                                                                             
   *  src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 [] */
从这个
列表(org.apache.spark)。OneToOneDependency@44d1924)
我推断没有进行洗牌,对吗?然而,下面的shuffleddd[17]被打印出来,这意味着实际上存在shuffling

我试图用一个
reduceByKey
来替换
groupByKey
调用,如下所示:

val indexedMeansR = vectors.                                                                                                                                                              
      map(v => findClosest(v, means) -> v).                                                                                                                                                   
      reduceByKey((a, b) => (a._1 + b._1) / 2 -> (a._2 + b._2) / 2)
其依赖关系和血统是:

/* ReduceBy:                                                                                                                                                                                 
   * Deps: 1                                                                                                                                                                                      
   * Deps: List(org.apache.spark.ShuffleDependency@4d5e813f)                                                                                                                                      
   * Lineage: (6) ShuffledRDD[17] at reduceByKey at StackOverflow.scala:211 []                                                                                                                    
   * +-(6) MapPartitionsRDD[16] at map at StackOverflow.scala:210 []                                                                                                                              
   *  MapPartitionsRDD[13] at map at StackOverflow.scala:139 []                                                                                                                                   
   *      CachedPartitions: 6; MemorySize: 84.0 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B                                                                                                
   *  MapPartitionsRDD[12] at values at StackOverflow.scala:116 []                                                                                                                                
   *  MapPartitionsRDD[11] at mapValues at StackOverflow.scala:115 []                                                                                                                             
   *  MapPartitionsRDD[10] at groupByKey at StackOverflow.scala:92 []                                                                                                                             
   *  MapPartitionsRDD[9] at join at StackOverflow.scala:91 []                                                                                                                                    
   *  MapPartitionsRDD[8] at join at StackOverflow.scala:91 []                                                                                                                                    
   *  CoGroupedRDD[7] at join at StackOverflow.scala:91 []                                                                                                                                        
   *    +-(6) MapPartitionsRDD[4] at map at StackOverflow.scala:88 []                                                                                                                             
   *  |  MapPartitionsRDD[3] at filter at StackOverflow.scala:88 []                                                                                                                               
   *  |  MapPartitionsRDD[2] at map at StackOverflow.scala:69 []                                                                                                                                  
   *  |  src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []                                                                          
   *  |  src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 []                                                                                 
   *    +-(6) MapPartitionsRDD[6] at map at StackOverflow.scala:89 []                                                                                                                             
   *  MapPartitionsRDD[5] at filter at StackOverflow.scala:89 []                                                                                                                                  
   *  MapPartitionsRDD[2] at map at StackOverflow.scala:69 []                                                                                                                                     
   *  src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []                                                                             
   *  src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 [] */
这一次,依赖关系是ShuffleDependency,我不明白为什么


由于RDD是一对,键是int,因此有一个顺序,我也尝试修改分区器并使用
RangePartitioner
,但它也没有改善一个
reduceByKey
操作仍然涉及一个洗牌,因为仍然需要确保具有相同密钥的所有项成为同一分区的一部分


然而,这将是一个比
groupByKey
操作小得多的洗牌操作。
reduceByKey
将在洗牌之前在每个分区内执行缩减操作,从而减少要洗牌的数据量。

但是,根据
依赖项的输出,
groupByKey
具有
OneToOneDependency
,它不涉及洗牌,而
reduceByKey
具有
shuffledependence
,这涉及到洗牌。为什么?
OneToOneDependency
对应于
mapValues
调用,而不是
groupByKey
调用。如果删除了该选项,您应该注意到
shuffledependence
。另外,请注意
groupByKey
沿袭中的
shuffleddd