在scala中的数据帧上应用groupBy和orderBy
我需要在dataframe中对列值进行排序并对另一列进行分组 数据框中的数据如下所示在scala中的数据帧上应用groupBy和orderBy,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,我需要在dataframe中对列值进行排序并对另一列进行分组 数据框中的数据如下所示 +------------+---------+-----+ | NUM_ID| TIME |SIG_V| +------------+---------+-----+ |XXXXX01 |167499000|55 | |XXXXX02 |167499000| | |XXXXX01 |167503000| | |XXXXX02 |1798100
+------------+---------+-----+
| NUM_ID| TIME |SIG_V|
+------------+---------+-----+
|XXXXX01 |167499000|55 |
|XXXXX02 |167499000| |
|XXXXX01 |167503000| |
|XXXXX02 |179810000| 81.0|
|XXXXX02 |179811000| 81.0|
|XXXXX01 |179833000| |
|XXXXX02 |179833000| |
|XXXXX02 |179841000| 81.0|
|XXXXX01 |179841000| |
|XXXXX02 |179842000| 81.0|
|XXXXX03 |179843000| 87.0|
|XXXXX02 |179849000| |
|XXXXX02 |179850000| |
|XXXXX01 |179850000| 88.0|
|XXXXX01 |179857000| |
|XXXXX01 |179858000| |
|XXXXX01 |179865000| |
|XXXXX03 |179865000| |
|XXXXX02 |179870000| |
|XXXXX02 |179871000| 11 |
+--------------------+-------+
+------------+---------+-----+
| NUM_ID| TIME |SIG_V|
+------------+---------+-----+
|XXXXX01 |167499000|55 |
|XXXXX01 |167503000| |
|XXXXX01 |179833000| |
|XXXXX01 |179841000| |
|XXXXX01 |179850000| 88.0|
|XXXXX01 |179857000| |
|XXXXX01 |179858000| |
|XXXXX01 |179865000| |
|XXXXX02 |167499000| |
|XXXXX02 |179810000| 81.0|
|XXXXX02 |179811000| 81.0|
|XXXXX02 |179833000| |
|XXXXX02 |179841000| 81.0|
|XXXXX02 |179849000| |
|XXXXX02 |179850000| |
|XXXXX02 |179842000| 81.0|
|XXXXX02 |179870000| |
|XXXXX02 |179871000| 11 |
|XXXXX03 |179843000| 87.0|
|XXXXX03 |179865000| |
+--------------------+-------+
以上数据已按时间列排序
我的要求是将NUM_ID列分组,如下所示
+------------+---------+-----+
| NUM_ID| TIME |SIG_V|
+------------+---------+-----+
|XXXXX01 |167499000|55 |
|XXXXX02 |167499000| |
|XXXXX01 |167503000| |
|XXXXX02 |179810000| 81.0|
|XXXXX02 |179811000| 81.0|
|XXXXX01 |179833000| |
|XXXXX02 |179833000| |
|XXXXX02 |179841000| 81.0|
|XXXXX01 |179841000| |
|XXXXX02 |179842000| 81.0|
|XXXXX03 |179843000| 87.0|
|XXXXX02 |179849000| |
|XXXXX02 |179850000| |
|XXXXX01 |179850000| 88.0|
|XXXXX01 |179857000| |
|XXXXX01 |179858000| |
|XXXXX01 |179865000| |
|XXXXX03 |179865000| |
|XXXXX02 |179870000| |
|XXXXX02 |179871000| 11 |
+--------------------+-------+
+------------+---------+-----+
| NUM_ID| TIME |SIG_V|
+------------+---------+-----+
|XXXXX01 |167499000|55 |
|XXXXX01 |167503000| |
|XXXXX01 |179833000| |
|XXXXX01 |179841000| |
|XXXXX01 |179850000| 88.0|
|XXXXX01 |179857000| |
|XXXXX01 |179858000| |
|XXXXX01 |179865000| |
|XXXXX02 |167499000| |
|XXXXX02 |179810000| 81.0|
|XXXXX02 |179811000| 81.0|
|XXXXX02 |179833000| |
|XXXXX02 |179841000| 81.0|
|XXXXX02 |179849000| |
|XXXXX02 |179850000| |
|XXXXX02 |179842000| 81.0|
|XXXXX02 |179870000| |
|XXXXX02 |179871000| 11 |
|XXXXX03 |179843000| 87.0|
|XXXXX03 |179865000| |
+--------------------+-------+
列NUM\u ID现在已分组,列TIME按每个NUM\u ID的顺序排序
我尝试将groupBy和orderBy应用于一个不起作用的数据帧
val df2 = df1.withColumn("SIG_V", col("SIG")).orderBy("TIME").groupBy("NUM_ID")
并在df2.show时获取错误
error: value orderBy is not a member of org.apache.spark.sql.RelationalGroupedDataset
要获得需求,是否有线索?您不需要
groupBy
,只需将这两列放入orderBy
:
scala> df.show()
+---+---+
| _1| _2|
+---+---+
| 1| 3|
| 2| 2|
| 1| 4|
| 1| 1|
| 2| 0|
| 1| 10|
| 2| 5|
+---+---+
scala> df.orderBy('_1,'_2).show()
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 1| 3|
| 1| 4|
| 1| 10|
| 2| 0|
| 2| 2|
| 2| 5|
+---+---+
您不需要
groupBy
,只需将这两列放入orderBy
:
scala> df.show()
+---+---+
| _1| _2|
+---+---+
| 1| 3|
| 2| 2|
| 1| 4|
| 1| 1|
| 2| 0|
| 1| 10|
| 2| 5|
+---+---+
scala> df.orderBy('_1,'_2).show()
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 1| 3|
| 1| 4|
| 1| 10|
| 2| 0|
| 2| 2|
| 2| 5|
+---+---+