Apache spark rdd后面的数字是什么意思

Apache spark rdd后面的数字是什么意思,apache-spark,rdd,Apache Spark,Rdd,rdd后面括号中的数字是什么意思 RDD后面的数字是其标识符: Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_15


rdd后面括号中的数字是什么意思

RDD后面的数字是其标识符:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val rdd = sc.range(0, 42)
rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[1] at range at <console>:24

scala> rdd.id
res0: Int = 1
这个数字很简单(
nextrdid
只是一个
AtomicInteger
):

生成:

因此,如果我们遵循:

scala> val pairs1 = sc.parallelize(Seq((1, "foo")))
pairs1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> val pairs2 = sc.parallelize(Seq((1, "bar")))
pairs2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[3] at parallelize at <console>:24


scala> pairs1.id
res5: Int = 2

scala> pairs2.id
res6: Int = 3
您希望看到4个,这可以通过检查UI来确认:

我们还可以看到,
join
在封面下创建了一些新的
rdd
5
6

private[spark] def newRddId(): Int = nextRddId.getAndIncrement()
/** A unique ID for this RDD (within its SparkContext). */
val id: Int = sc.newRddId()
scala> val pairs1 = sc.parallelize(Seq((1, "foo")))
pairs1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> val pairs2 = sc.parallelize(Seq((1, "bar")))
pairs2: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[3] at parallelize at <console>:24


scala> pairs1.id
res5: Int = 2

scala> pairs2.id
res6: Int = 3
scala> pairs1.join(pairs2).foreach(_ => ())