Pyspark spark查询结果受随机分区计数的影响
我有下面的代码对工资进行分组Pyspark spark查询结果受随机分区计数的影响,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我有下面的代码对工资进行分组 # this is a sample to learn about shuffle partitions config property def getDataFrame(): data = [('Eric', 'history', 4000), ('Adam', '\Economics', 3000), ('Angela', 'Science', 6000)] dataDF = spark.createDataFrame(data, 'name STRI
# this is a sample to learn about shuffle partitions config property
def getDataFrame():
data = [('Eric', 'history', 4000), ('Adam', '\Economics', 3000), ('Angela', 'Science', 6000)]
dataDF = spark.createDataFrame(data, 'name STRING, dept STRING, salary INT')
# order by and group by triggers shuffle leading to shuffle partitions
groupedDF = dataDF.orderBy("salary",ascending=True).groupBy("salary").count()
groupedDF.show()
print("Number of partitions: ",groupedDF.rdd.getNumPartitions())
spark.conf.set("spark.sql.shuffle.partitions", 200)
getDataFrame()
spark.conf.set("spark.sql.shuffle.partitions", 80)
getDataFrame()
我希望结果与数据相同,聚合函数相同,但结果似乎受到随机分区的影响
有人能解释一下这种行为吗?除了
groupBy()之外,您还可以使用orderBy()
即使在更新洗牌分区后也能得到相同的结果。在数据帧上的groupBy
会产生RelationalGroupedDataset
,它与数据帧一样不应该保留元素的顺序。