Apache spark 为什么连接两个数据流后的分区数在变化_Apache Spark_Spark Streaming

Apache spark 为什么连接两个数据流后的分区数在变化

apache-spark

Apache spark 为什么连接两个数据流后的分区数在变化,apache-spark,spark-streaming,Apache Spark,Spark Streaming,这里，我的流_1和数据流_2有128个分区，但是当我进行连接时，分区会减少到3个分区，为什么会这样呢。正如我所知，连接是按分区完成的，即分区1将与另一个Rdd的分区1连接。所有过滤的RDD都有3个分区，这就是historyRDD和HistoryRDD2有3个分区的原因。spark中的分区取决于您执行的操作。例如，groupByKey（）保留其父RDD的分区号（如果有的话）。相反，一些操作，如join（）将RDD1和RDD2的分区数相加。假设RDD1有2个分区，RDD2有3个分区，则join（）的

这里，我的流_1和数据流_2有128个分区，但是当我进行连接时，分区会减少到3个分区，为什么会这样呢。正如我所知，连接是按分区完成的，即分区1将与另一个Rdd的分区1连接。所有过滤的RDD都有3个分区，这就是historyRDD和HistoryRDD2有3个分区的原因。

spark中的分区取决于您执行的操作。例如，

groupByKey（）

保留其父RDD的分区号（如果有的话）。相反，一些操作，如

join（）

将RDD1和RDD2的分区数相加。假设RDD1有2个分区，RDD2有3个分区，则join（）的结果将是5

然后，您可以告诉为什么分区在减少，将HistoryStream_1（128个分区）和HistoryStream_2（128个分区）合并，所以根据您的逻辑，它应该是256，但它是3。

val sparkConf = new SparkConf().setMaster("yarn-cluster")
                               .setAppName("SparkJob")
                               .set("spark.executor.memory","2G")
                               .set("spark.dynamicAllocation.executorIdleTimeout","5")


val streamingContext = new StreamingContext(sparkConf, Minutes(1))

var historyRdd: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD

var historyRdd_2: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD


val stream_1 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams ,  Set(inputTopic_1))
val dstream_2 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams ,  Set(inputTopic_2))


val dstream_2 = stream_2.map((r: Tuple2[String, GenericData.Record]) => 
{
  //some mapping
}

val historyDStream = dstream_1.transform(rdd => rdd.union(historyRdd))
dstream_2.foreachRDD(r => r.repartition(500))
val historyDStream_2 = dstream_2.transform(rdd => rdd.union(historyRdd_2))
val fullJoinResult = historyDStream.fullOuterJoin(historyDStream_2)

 val filtered = fullJoinResult.filter(r => r._2._1.isEmpty)


filtered.foreachRDD{rdd =>

  val formatted = rdd.map(r  => (r._1 , r._2._2.get)) 

  historyRdd_2.unpersist(false) // unpersist the 'old' history RDD
  historyRdd_2 = formatted // assign the new history
  historyRdd_2.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}


val filteredStream = fullJoinResult.filter(r => r._2._2.isEmpty)


filteredStream.foreachRDD{rdd =>
  val formatted = rdd.map(r => (r._1 , r._2._1.get)) 
  historyRdd.unpersist(false) // unpersist the 'old' history RDD
  historyRdd = formatted // assign the new history
  historyRdd.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
streamingContext.start()
streamingContext.awaitTermination()
 }
}