Csv 数据帧的groupby与shuffle分区中实际分区的Pyspark数量

Csv 数据帧的groupby与shuffle分区中实际分区的Pyspark数量,csv,apache-spark,pyspark-dataframes,Csv,Apache Spark,Pyspark Dataframes,我有一个Movielens CSV数据集文件,其中列为“movieID”、“UserID”、“Rating”、“Timestamp”。我将每部电影的评级按计数和平均值进行汇总。下面是我的代码 schema = StructType([ StructField('UserID', StringType(), True), StructField('MovieID', StringType(), True),

我有一个Movielens CSV数据集文件,其中列为“movieID”、“UserID”、“Rating”、“Timestamp”。我将每部电影的评级按计数和平均值进行汇总。下面是我的代码

schema = StructType([
                    StructField('UserID', StringType(), True),
                    StructField('MovieID', StringType(), True),
                    StructField('Rating', FloatType(), True),
                    StructField('Timestamp', StringType(), True)
                    ])

movie_df = spark.read.csv('../resources/ml-latest-small/ratings.csv', schema=schema, header='true') \
    .select('MovieID', 'Rating')

movie_df.createOrReplaceTempView('movie_tbl')

popular_df = spark.sql("""
                        SELECT MovieID, count(*) AS rating_count, avg(Rating) AS avg_rating
                        FROM movie_tbl
                        GROUP BY MovieID
                        ORDER BY count(*) DESC """)

popular_df.write.csv(path='output/avg_ratings', mode='append', header='True')

默认情况下,spark shuffle分区的数量是200,但我的流行分区只有46个,而不是200个。当我运行explain时,我可以看到explain的rangepartitioning和hashpartitioning步骤中的分区由200完成

>>> movie_df.rdd.getNumPartitions()
1
>>> 
>>> popular_df.rdd.getNumPartitions()
46                                                                              
>>> spark.conf.get('spark.sql.shuffle.partitions')
'200'
>>>
>>> popular_df.explain()
== Physical Plan ==
*(3) Sort [rating_count#10L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(rating_count#10L DESC NULLS LAST, 200), true, [id=#32]
   +- *(2) HashAggregate(keys=[MovieID#1], functions=[count(1), avg(cast(Rating#2 as double))])
      +- Exchange hashpartitioning(MovieID#1, 200), true, [id=#28]
         +- *(1) HashAggregate(keys=[MovieID#1], functions=[partial_count(1), partial_avg(cast(Rating#2 as double))])
            +- FileScan csv [MovieID#1,Rating#2] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<MovieID:string,Rating:float>
 
>movie_df.rdd.getNumPartitions()
1.
>>> 
>>>popular_df.rdd.getNumPartitions()
46
>>>spark.conf.get('spark.sql.shuffle.partitions')
'200'
>>>
>>>通俗的解释
==实际计划==
*(3) 排序[rating#u count#10L DESC NULLS LAST],真,0
+-Exchange rangepartitioning(rating#u count#10L DESC NULLS LAST,200),true,[id=#32]
+-*(2)HashAggregate(键=[MovieID#1],函数=[count(1),平均值(cast(评级#2为双精度)))
+-Exchange hashpartitioning(MovieID#1200),true[id=#28]
+-*(1)HashAggregate(键=[MovieID#1],函数=[partial#u count(1),partial#u avg(cast(评级#2为双精度)))
+-FileScan csv[MovieID#1,评级#2]批处理:false,数据过滤器:[],格式:csv,位置:InMemoryFileIndex[file:/Users/sgudisa/Desktop/python数据分析工作簿/spark工作簿/Resource…,分区过滤器:[],PushedFilters:[],ReadSchema:struct
那么系统是如何达到46个分区的呢

另外,当我使用writeDataFrameReader保存df时,我可以看到45个CSV数据文件(从part-00000-*到part-00044-*)和一个_成功文件。如果我将洗牌分区更改为4,则它是3个CSV文件和一个_成功文件。那么,系统在进行DAG时是否合并一个或减少一个重新分区