Dataframe Spark数据帧分区
目前,我有一个数据帧。我想把它们分成几个独立的数据帧,然后依次处理它们 星星之火,如:Dataframe Spark数据帧分区,dataframe,apache-spark,apache-spark-sql,Dataframe,Apache Spark,Apache Spark Sql,目前,我有一个数据帧。我想把它们分成几个独立的数据帧,然后依次处理它们 星星之火,如: +--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+ | id|data_identifier_method| start_time| end_time|time_interval
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
然后我想将其划分为四个数据帧:
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
我该怎么办
换句话说,如果我不划分原始数据帧,如何对原始数据帧中的这四项进行操作?“我想将它们划分为几个独立的数据帧,然后依次处理。”
你到底想解决什么问题?根据您的需求,有很多方法可以完成类似的任务,每种方法都有非常不同的内存和性能特征
val partitionColVals=dataframe.groupBy(“partition\u col”).map(\uu.get(0)).collect
对于(val partitionColVal,可以使用NTILE
NTILE是一个Spark SQL分析函数。它将有序数据集划分为若干个由expr表示的存储桶,并为每一行分配适当的存储桶编号。存储桶编号为1到expr。每个分区的expr值必须解析为正常量
它将有一个分区号(NTILE值)。现在,您可以使用筛选器创建NTILE函数中指定的数据集号
下面是伪代码
val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
+---+-------+-----+
| id|sum_val|ntile|
+---+-------+-----+
| B| 3| 1|
| E| 4| 1|
| H| 4| 2|
| D| 14| 2|
| A| 14| 3|
| F| 30| 3|
| C| 34| 4|
+---+-------+-----+
使用过滤器如何?
?如何做到这一点?我建议您阅读关于基本数据帧操作的许多现有教程中的一个。例如:如果您仍然面临问题,请回到这里,展示您已经尝试过的内容,以及哪些内容没有达到预期效果。
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid1_set1_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| id|data_identifier_method| start_time| end_time|time_interval| time| value|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:00|342342.12|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:05|342421.88|
| fd784213423f| algid2_set2_total...|20200903 00:00:00|20200903 00:00:10| 5|20200903 00:00:10|351232.92|
+--------------+----------------------+-----------------+-----------------+-------------+-----------------+---------+
val partitionColVals = dataframe.groupBy("partition_col").map(_.get(0)).collect
for(val partitionColVal <- partitionColVals){
dataframe.where(col("partition_col" === lit(partitionColVal)).repartition(1)...
}
4. Window functions - this is an SQL concept that allows you todo operations on sets of matching keys. Kinda like a high powered group by. This will depend on what you're trying todo.
val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
+---+-------+-----+
| id|sum_val|ntile|
+---+-------+-----+
| B| 3| 1|
| E| 4| 1|
| H| 4| 2|
| D| 14| 2|
| A| 14| 3|
| F| 30| 3|
| C| 34| 4|
+---+-------+-----+