Apache spark 为什么Spark RDD为小数据保留更多的分区
我通过将集合传递给Apache spark 为什么Spark RDD为小数据保留更多的分区,apache-spark,Apache Spark,我通过将集合传递给sparkContextparallelize方法来创建RDD。我的问题是为什么它给我8个分区,因为我只有3条记录。我有空的分区吗 scala> val rdd = sc.parallelize(List("surender","raja","kumar")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:40 sca
sparkContext
parallelize
方法来创建RDD。我的问题是为什么它给我8个分区,因为我只有3条记录。我有空的分区吗
scala> val rdd = sc.parallelize(List("surender","raja","kumar"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:40
scala> rdd.partitions.length
res0: Int = 8
scala> rdd.partitions
res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691,
org.apache.spark.rdd.ParallelCollectionPartition@692,
org.apache.spark.rdd.ParallelCollectionPartition@693,
org.apache.spark.rdd.ParallelCollectionPartition@694,
org.apache.spark.rdd.ParallelCollectionPartition@695,
org.apache.spark.rdd.ParallelCollectionPartition@696,
org.apache.spark.rdd.ParallelCollectionPartition@697,
org.apache.spark.rdd.ParallelCollectionPartition@698)
scala> rdd.getNumPartitions
res2: Int = 8
scala>val rdd=sc.parallelize(列表(“surender”、“raja”、“kumar”))
rdd:org.apache.spark.rdd.rdd[String]=ParallelCollectionRDD[0]位于parallelize at:40
scala>rdd.partitions.length
res0:Int=8
scala>rdd.partitions
res1:Array[org.apache.spark.Partition]=Array(org.apache.spark.rdd。ParallelCollectionPartition@691,
org.apache.spark.rdd。ParallelCollectionPartition@692,
org.apache.spark.rdd。ParallelCollectionPartition@693,
org.apache.spark.rdd。ParallelCollectionPartition@694,
org.apache.spark.rdd。ParallelCollectionPartition@695,
org.apache.spark.rdd。ParallelCollectionPartition@696,
org.apache.spark.rdd。ParallelCollectionPartition@697,
org.apache.spark.rdd。ParallelCollectionPartition@698)
scala>rdd.getNumPartitions
res2:Int=8
如果您不提供分区的数量,它将创建spark.default.parallelism
中定义的分区,您可以检查其值runningsc.defaultParallelism
它的价值取决于您运行的位置和硬件:
根据(查找spark.default.parallelism
)
它取决于群集管理器:本地模式:本地计算机上的内核数
细晶模式:8
其他:所有执行器节点上的内核总数或2个,以较大者为准 您可以在
parallelize
方法中使用第二个参数指定分区数
例如:
val rdd=sc.parallelize(列表(“surender”、“raja”、“kumar”),5)
scala>rdd.partitions.length
res1:Int=5
scala>sc.defaultParallelism
res2:Int=4
Ok。我无法在cloudera manager中看到此配置。我检查了集群中每个节点的可用内核,节点为32个,节点为16个。我正在从边缘节点运行spark shell。那么我在哪里看到这个8被设置?你在spark shell中运行了sc.defaultParallelism吗?您的边缘节点有多少个节点?sc.defaultParallelism给我8个。我的问题是,如果我想将sc.defaultParallelism返回到4或6,该怎么办。我的边缘节点有32个核心