Dataframe Spark数据帧分区

Dataframe Spark数据帧分区,dataframe,apache-spark,apache-spark-sql,Dataframe,Apache Spark,Apache Spark Sql,目前,我有一个数据帧。我想把它们分成几个独立的数据帧,然后依次处理它们 星星之火,如: +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+ | id|data_identifier_method| start_time| end_time|time_interval

目前,我有一个数据帧。我想把它们分成几个独立的数据帧,然后依次处理它们

星星之火,如:

+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+

然后我想将其划分为四个数据帧:

+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
我该怎么办

换句话说,如果我不划分原始数据帧,如何对原始数据帧中的这四项进行操作?

“我想将它们划分为几个独立的数据帧,然后依次处理。”

你到底想解决什么问题?根据您的需求,有很多方法可以完成类似的任务,每种方法都有非常不同的内存和性能特征

  • 使用表分区。所有输出文件最终都位于同一文件夹路径中,但从技术上讲,每个分区可以有多个文件。可以选择按分区列进行排序,以最小化输出文件的数量
  • dataframe.write.partitionBy(“partition_col”).parquet(“s3://bucket/path/”)

  • 使用HashAggregation将所有匹配的记录合并到同一分区中,但多个键可能最终位于同一分区中
  • dataframe.repartition(“partition\u col”)

  • 收集分区列值,然后对每个值运行独立作业。这允许您将所有记录收集到一个分区中,但这比前两个分区慢得多,需要重新读取许多记录,并且会将它们阻塞到一个分区中,这可能会导致性能问题和内存问题
  • val partitionColVals=dataframe.groupBy(“partition\u col”).map(\uu.get(0)).collect
    
    对于(val partitionColVal,可以使用NTILE

    NTILE是一个Spark SQL分析函数。它将有序数据集划分为若干个由expr表示的存储桶,并为每一行分配适当的存储桶编号。存储桶编号为1到expr。每个分区的expr值必须解析为正常量

    它将有一个分区号(NTILE值)。现在,您可以使用筛选器创建NTILE函数中指定的数据集号

    下面是伪代码

    val w = Window.orderBy("sum_val")
    val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
    
    +---+-------+-----+
    | id|sum_val|ntile|
    +---+-------+-----+
    |  B|      3|    1|
    |  E|      4|    1|
    |  H|      4|    2|
    |  D|     14|    2|
    |  A|     14|    3|
    |  F|     30|    3|
    |  C|     34|    4|
    +---+-------+-----+
    

    使用
    过滤器如何?
    ?如何做到这一点?我建议您阅读关于基本数据帧操作的许多现有教程中的一个。例如:如果您仍然面临问题,请回到这里,展示您已经尝试过的内容,以及哪些内容没有达到预期效果。
    +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
    |            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
    +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
    |  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
    |  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
    |  fd784213423f|  algid1_set1_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
    +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
    
    +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
    |            id|data_identifier_method|       start_time|         end_time|time_interval|             time|    value|
    +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
    |  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:00|342342.12|
    |  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:05|342421.88|
    |  fd784213423f|  algid2_set2_total...|20200903 00:00:00|20200903 00:00:10|            5|20200903 00:00:10|351232.92|
    +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
    
    val partitionColVals = dataframe.groupBy("partition_col").map(_.get(0)).collect
    for(val partitionColVal <- partitionColVals){
      dataframe.where(col("partition_col" === lit(partitionColVal)).repartition(1)...
    }
    
    4.  Window functions - this is an SQL concept that allows you todo operations on sets of matching keys.  Kinda like a high powered group by.  This will depend on what you're trying todo.
    
    val w = Window.orderBy("sum_val")
    val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
    
    +---+-------+-----+
    | id|sum_val|ntile|
    +---+-------+-----+
    |  B|      3|    1|
    |  E|      4|    1|
    |  H|      4|    2|
    |  D|     14|    2|
    |  A|     14|    3|
    |  F|     30|    3|
    |  C|     34|    4|
    +---+-------+-----+