Apache spark Spark数据帧分区计数_Apache Spark_Dataframe

Apache spark Spark数据帧分区计数

apache-spark dataframe

Apache spark Spark数据帧分区计数,apache-spark,dataframe,Apache Spark,Dataframe,我对spark如何在spark dataframe中创建分区感到困惑。下面是步骤列表和分区大小 i_df = sqlContext.read.json("json files") // num partitions returned is 4, total records 7000 p_df = sqlContext.read.format("csv").Other options // num partitions returned is 4 , total records: 120k

我对spark如何在spark dataframe中创建分区感到困惑。下面是步骤列表和分区大小

i_df = sqlContext.read.json("json files")  // num partitions returned is 4, total records 7000
p_df = sqlContext.read.format("csv").Other options   // num partitions returned is 4 , total records: 120k
j_df = i_df.join(p_df, i_df.productId == p_df.product_id) // total records 7000, but num of partitions is 200

前两个数据帧有4个分区，但只要我加入它们，就会显示出200个分区。我原以为它加入后会有4个分区，但为什么会显示200个呢

我正在本地运行它

conf.setIfMissing（“spark.master”、“local[4]”

200是默认的随机播放分区大小。您可以通过设置

spark.sql.shuffle.partitions

来更改它，200是默认的shuffle分区大小。您可以通过设置

spark.sql.shuffle.partitions

来更改它，谢谢您的支持。在连接过程中，是否需要保留分区。我在j_df上使用了一个过滤器操作，它产生了2行计数，但分区仍然显示200，这意味着什么。谢谢你的anwser。在连接过程中，是否需要保留分区。我在j_df上使用了一个过滤器操作，它产生了2行计数，但分区仍然显示200，这是什么意思。