Apache spark 火花凝聚性能_Apache Spark

Apache spark 火花凝聚性能

apache-spark

Apache spark 火花凝聚性能,apache-spark,Apache Spark,现在对于我的应用程序，我发现它有显著的性能差异，这取决于coalesce的shuffle参数 val subscriber = a.join(b).map(x=>(x._1,x_2)) // subscriber' partition number is above 40000 subscriber.coalesce(5,true).saveAsTextFile("result") // performance good subscriber.coalesce(5,false).saveA

现在对于我的应用程序，我发现它有显著的性能差异，这取决于coalesce的shuffle参数

val subscriber = a.join(b).map(x=>(x._1,x_2))
// subscriber' partition number is above 40000
subscriber.coalesce(5,true).saveAsTextFile("result") // performance good
subscriber.coalesce(5,false).saveAsTextFile("result") // performance poor

根据联合的Spark文件

但是，如果您正在进行剧烈合并，例如to numPartitions=1，这可能会导致您的计算发生在比您希望的更少的节点上（例如，numPartitions=1的情况下是一个节点）。要避免这种情况，可以传递shuffle=true。这将添加一个洗牌步骤，但意味着当前上游分区将并行执行（无论当前分区是什么）

但我不明白，如果我不将shuffle设置为true，为什么这会导致计算在更少的节点上进行。有人能解释一下吗？

同样的问题。但是spark 1.6.1没有问题。在spark 2.0.2中，合并速度非常慢。