Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/cassandra/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 为什么Spark RDD为小数据保留更多的分区_Apache Spark - Fatal编程技术网

Apache spark 为什么Spark RDD为小数据保留更多的分区

Apache spark 为什么Spark RDD为小数据保留更多的分区,apache-spark,Apache Spark,我通过将集合传递给sparkContextparallelize方法来创建RDD。我的问题是为什么它给我8个分区,因为我只有3条记录。我有空的分区吗 scala> val rdd = sc.parallelize(List("surender","raja","kumar")) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:40 sca

我通过将集合传递给
sparkContext
parallelize
方法来创建RDD。我的问题是为什么它给我8个分区,因为我只有3条记录。我有空的分区吗

 scala> val rdd = sc.parallelize(List("surender","raja","kumar"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:40

scala> rdd.partitions.length
res0: Int = 8

scala> rdd.partitions
res1: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@691, 
org.apache.spark.rdd.ParallelCollectionPartition@692, 
org.apache.spark.rdd.ParallelCollectionPartition@693, 
org.apache.spark.rdd.ParallelCollectionPartition@694, 
org.apache.spark.rdd.ParallelCollectionPartition@695, 
org.apache.spark.rdd.ParallelCollectionPartition@696, 
org.apache.spark.rdd.ParallelCollectionPartition@697, 
org.apache.spark.rdd.ParallelCollectionPartition@698)

scala> rdd.getNumPartitions
res2: Int = 8
scala>val rdd=sc.parallelize(列表(“surender”、“raja”、“kumar”))
rdd:org.apache.spark.rdd.rdd[String]=ParallelCollectionRDD[0]位于parallelize at:40
scala>rdd.partitions.length
res0:Int=8
scala>rdd.partitions
res1:Array[org.apache.spark.Partition]=Array(org.apache.spark.rdd。ParallelCollectionPartition@691, 
org.apache.spark.rdd。ParallelCollectionPartition@692, 
org.apache.spark.rdd。ParallelCollectionPartition@693, 
org.apache.spark.rdd。ParallelCollectionPartition@694, 
org.apache.spark.rdd。ParallelCollectionPartition@695, 
org.apache.spark.rdd。ParallelCollectionPartition@696, 
org.apache.spark.rdd。ParallelCollectionPartition@697, 
org.apache.spark.rdd。ParallelCollectionPartition@698)
scala>rdd.getNumPartitions
res2:Int=8

如果您不提供分区的数量,它将创建
spark.default.parallelism
中定义的分区,您可以检查其值running
sc.defaultParallelism

它的价值取决于您运行的位置和硬件:

根据(查找
spark.default.parallelism

它取决于群集管理器:
本地模式:本地计算机上的内核数
细晶模式:8
其他:所有执行器节点上的内核总数或2个,以较大者为准

您可以在
parallelize
方法中使用第二个参数指定分区数

例如:

val rdd=sc.parallelize(列表(“surender”、“raja”、“kumar”),5)
scala>rdd.partitions.length
res1:Int=5
scala>sc.defaultParallelism
res2:Int=4

Ok。我无法在cloudera manager中看到此配置。我检查了集群中每个节点的可用内核,节点为32个,节点为16个。我正在从边缘节点运行spark shell。那么我在哪里看到这个8被设置?你在spark shell中运行了sc.defaultParallelism吗?您的边缘节点有多少个节点?sc.defaultParallelism给我8个。我的问题是,如果我想将sc.defaultParallelism返回到4或6,该怎么办。我的边缘节点有32个核心