Scala Spark-在withColumn(“newCol”)之后,collect_list(…)选择包含多个元素的行

Scala Spark-在withColumn(“newCol”)之后,collect_list(…)选择包含多个元素的行,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,我正在使用从这个json创建的数据帧: {"id" : "1201", "name" : "satish", "age" : "25"}, {"id" : "1202", "name" : "krishna", "age" : "28"}, {"id" : "1203", "name" : "amith", "age" : "39"}, {"id" : "1204", "name" : "javed", "age" : "23"}, {"id" : "1205", "name" : "mendy

我正在使用从这个json创建的数据帧:

{"id" : "1201", "name" : "satish", "age" : "25"},
{"id" : "1202", "name" : "krishna", "age" : "28"},
{"id" : "1203", "name" : "amith", "age" : "39"},
{"id" : "1204", "name" : "javed", "age" : "23"},
{"id" : "1205", "name" : "mendy", "age" : "25"},
{"id" : "1206", "name" : "rob", "age" : "24"},
{"id" : "1207", "name" : "prudvi", "age" : "23"}
最初,数据帧如下所示:

+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 25|1205|  mendy|
| 24|1206|    rob|
| 23|1207| prudvi|
+---+----+-------+
我需要的是将所有年龄相同的学生分组,根据他们的id对他们进行排序。这就是我目前的做法:

*注意:我很确定,使用
withColumn(“newCol”),然后使用
select(“newCol”)
添加一个新列比添加一个新列更有效,但我不知道如何更好地解决它

 val conf = new SparkConf().setAppName("SimpleApp").set("spark.driver.allowMultipleContexts", "true").setMaster("local[*]")
    val sc = new SparkContext(conf)

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

    val df = sqlContext.read.json("students.json")

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions._

    val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id"))).select("List")
我得到的结果是:

[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([24,1206,rob])]
[WrappedArray([23,1204,javed])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([28,1202,krishna])]
[WrappedArray([39,1203,amith])]
现在,如何筛选包含多个元素的行?也就是说,我希望我的最终数据帧是:

[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
到目前为止,我的最佳方法是:

val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id")))

val filterd = mergedDF.withColumn("count", count("age").over(Window.partitionBy("age"))).filter($"count" > 1).select("newCol")
但我肯定错过了什么,因为结果不是预期的:

[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([25,1201,satish])]
[WrappedArray([25,1201,satish], [25,1205,mendy])]
您可以使用
size()
筛选数据:

import org.apache.spark.sql.functions.{col,size}

mergedDF.filter(size(col("newCol"))>1).show(false)

+---+----+------+-----------------------------------+
|age|id  |name  |newCol                             |
+---+----+------+-----------------------------------+
|23 |1207|prudvi|[[23,1204,javed], [23,1207,prudvi]]|
|25 |1205|mendy |[[25,1201,satish], [25,1205,mendy]]|
+---+----+------+-----------------------------------+