Csv 数据帧的groupby与shuffle分区中实际分区的Pyspark数量_Csv_Apache Spark_Pyspark Dataframes

Csv 数据帧的groupby与shuffle分区中实际分区的Pyspark数量

csv apache-spark

Csv 数据帧的groupby与shuffle分区中实际分区的Pyspark数量,csv,apache-spark,pyspark-dataframes,Csv,Apache Spark,Pyspark Dataframes,我有一个Movielens CSV数据集文件，其中列为“movieID”、“UserID”、“Rating”、“Timestamp”。我将每部电影的评级按计数和平均值进行汇总。下面是我的代码 schema = StructType([ StructField('UserID', StringType(), True), StructField('MovieID', StringType(), True),

我有一个Movielens CSV数据集文件，其中列为“movieID”、“UserID”、“Rating”、“Timestamp”。我将每部电影的评级按计数和平均值进行汇总。下面是我的代码

schema = StructType([
                    StructField('UserID', StringType(), True),
                    StructField('MovieID', StringType(), True),
                    StructField('Rating', FloatType(), True),
                    StructField('Timestamp', StringType(), True)
                    ])

movie_df = spark.read.csv('../resources/ml-latest-small/ratings.csv', schema=schema, header='true') \
    .select('MovieID', 'Rating')

movie_df.createOrReplaceTempView('movie_tbl')

popular_df = spark.sql("""
                        SELECT MovieID, count(*) AS rating_count, avg(Rating) AS avg_rating
                        FROM movie_tbl
                        GROUP BY MovieID
                        ORDER BY count(*) DESC """)

popular_df.write.csv(path='output/avg_ratings', mode='append', header='True')

默认情况下，spark shuffle分区的数量是200，但我的流行分区只有46个，而不是200个。当我运行explain时，我可以看到explain的rangepartitioning和hashpartitioning步骤中的分区由200完成

>>> movie_df.rdd.getNumPartitions()
1
>>> 
>>> popular_df.rdd.getNumPartitions()
46                                                                              
>>> spark.conf.get('spark.sql.shuffle.partitions')
'200'
>>>
>>> popular_df.explain()
== Physical Plan ==
*(3) Sort [rating_count#10L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(rating_count#10L DESC NULLS LAST, 200), true, [id=#32]
   +- *(2) HashAggregate(keys=[MovieID#1], functions=[count(1), avg(cast(Rating#2 as double))])
      +- Exchange hashpartitioning(MovieID#1, 200), true, [id=#28]
         +- *(1) HashAggregate(keys=[MovieID#1], functions=[partial_count(1), partial_avg(cast(Rating#2 as double))])
            +- FileScan csv [MovieID#1,Rating#2] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<MovieID:string,Rating:float>

>movie_df.rdd.getNumPartitions（）
1.
>>> 
>>>popular_df.rdd.getNumPartitions（）
46
>>>spark.conf.get（'spark.sql.shuffle.partitions'）
'200'
>>>
>>>通俗的解释
==实际计划==
*（3） 排序[rating#u count#10L DESC NULLS LAST]，真，0
+-Exchange rangepartitioning（rating#u count#10L DESC NULLS LAST，200），true，[id=#32]
+-*（2）HashAggregate（键=[MovieID#1]，函数=[count（1），平均值（cast（评级#2为双精度）））
+-Exchange hashpartitioning（MovieID#1200），true[id=#28]
+-*（1）HashAggregate（键=[MovieID#1]，函数=[partial#u count（1），partial#u avg（cast（评级#2为双精度）））
+-FileScan csv[MovieID#1，评级#2]批处理：false，数据过滤器：[]，格式：csv，位置：InMemoryFileIndex[file:/Users/sgudisa/Desktop/python数据分析工作簿/spark工作簿/Resource…，分区过滤器：[]，PushedFilters:[]，ReadSchema:struct

那么系统是如何达到46个分区的呢

另外，当我使用writeDataFrameReader保存df时，我可以看到45个CSV数据文件（从part-00000-*到part-00044-*）和一个_成功文件。如果我将洗牌分区更改为4，则它是3个CSV文件和一个_成功文件。那么，系统在进行DAG时是否合并一个或减少一个重新分区