Apache spark 找出分区编号/id_Apache Spark

Apache spark 找出分区编号/id

apache-spark

Apache spark 找出分区编号/id,apache-spark,Apache Spark,Spark中有没有找到分区ID/No的方法举个例子 val input1 = sc.parallelize(List(8, 9, 10), 3) val res = input1.reduce{ (x, y) => println("Inside partiton " + ???) x + y)} 我想在？中添加一些代码来打印分区ID/No实际上，映射分区索引将为您提供一个迭代器&分区索引。（当然，这与reduce不同，

Spark中有没有找到分区ID/No的方法

举个例子

val input1 = sc.parallelize(List(8, 9, 10), 3)

val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)

                               x + y)}

我想在

？

中添加一些代码来打印分区ID/No

实际上，

映射分区索引将为您提供一个迭代器&分区索引。（当然，这与reduce不同，但您可以将其结果与聚合
）结合起来。
根据@Holden的建议，使用mapParitionsWithIndex
在此处发布答案
我已经创建了一个带有3个分区的RDD（Input
）。input
中的元素在调用mapPartitionsWithIndex

scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21

scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)

scala>val输入=sc.parallelize（11到17,3）
输入：org.apache.spark.rdd.rdd[Int]=ParallelCollectionRDD[9]位于parallelize at:21
scala>input.mapPartitionsWithIndex{（index，itr）=>itr.toList.map（x=>x+“#”+index.iterator}.collect（）
res8:Array[String]=数组（11#0,12#0,13#1,14#1,15#2,16#2,17#2）
您也可以使用
TaskContext.getPartitionId()

e、 g.代替目前缺失的foreachPartitionWithIndex（）
我在寻找DataFrame
的spark\u partition\u id
sql函数时遇到了这个老问题
val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect

res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])

Hi@Holden，mapPartitionsWithIndex（）实际上创建了一个新的RDD。这个方法mapParitions（）和mapParitionsWithIndex（）的具体用途是什么。任何特定的用例？通过使用MapParationsWithindex，您可以输出具有分区的新元素，然后当您减少时，您将知道您正在处理来自哪个分区的元素。你还想做些什么吗？这正好回答了我的问题。花了些时间才明白。我已经单独给出了解决方案。向专家学习很好：）