Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/cmake/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pyspark spark查询结果受随机分区计数的影响_Pyspark_Apache Spark Sql - Fatal编程技术网

Pyspark spark查询结果受随机分区计数的影响

Pyspark spark查询结果受随机分区计数的影响,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我有下面的代码对工资进行分组 # this is a sample to learn about shuffle partitions config property def getDataFrame(): data = [('Eric', 'history', 4000), ('Adam', '\Economics', 3000), ('Angela', 'Science', 6000)] dataDF = spark.createDataFrame(data, 'name STRI

我有下面的代码对工资进行分组

# this is a sample to learn about shuffle partitions config property

def getDataFrame():
  data = [('Eric', 'history', 4000), ('Adam', '\Economics', 3000), ('Angela', 'Science', 6000)]
  dataDF = spark.createDataFrame(data, 'name STRING, dept STRING, salary INT')

  # order by and group by triggers shuffle leading to shuffle partitions
  groupedDF = dataDF.orderBy("salary",ascending=True).groupBy("salary").count()
  groupedDF.show()
  print("Number of partitions: ",groupedDF.rdd.getNumPartitions())


spark.conf.set("spark.sql.shuffle.partitions", 200)
getDataFrame()
spark.conf.set("spark.sql.shuffle.partitions", 80)
getDataFrame()

我希望结果与数据相同,聚合函数相同,但结果似乎受到随机分区的影响


有人能解释一下这种行为吗?

除了
groupBy()之外,您还可以使用
orderBy()
即使在更新洗牌分区后也能得到相同的结果。

数据帧上的
groupBy
会产生
RelationalGroupedDataset
,它与数据帧一样不应该保留元素的顺序。