Apache spark 如何根据spark中的分组后的值进行筛选

Apache spark 如何根据spark中的分组后的值进行筛选,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,假设我有以下数据框: val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show +----+---+---+ |name|tag|num| +----+---+---+ | aa| b| 1| | aa| c| 5| | aa| d| 0| | xx| y| 5| | z| zz| 9|

假设我有以下数据框:

val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show
+----+---+---+
|name|tag|num|
+----+---+---+
|  aa|  b|  1|
|  aa|  c|  5|
|  aa|  d|  0|
|  xx|  y|  5|
|   z| zz|  9|
|   z|  b| 12|
+----+---+---+
我想筛选此数据帧,以便:

对于每组数据(按名称分组),如果列标记的值为“b”,我将取num列的最大值,否则我将忽略该行

这是我想要的结果:

+----+---+---+
|name|tag|num|
+----+---+---+
|  aa|  c|  5|
|   z|  b| 12|
+----+---+---+
解释

  • 名为'aa'的行组是一行,其中tag的值为'b',因此我取这个组的num的最大值为5
  • 名称为'xx'的行组中没有tag值为'b'的行,因此它是w
  • 名称为class='z'的行组是一行,其中tag的值=='b',因此我取这个组的num的最大值为12
试试这个:

val df=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num")
df.createOrReplaceTempView("tab")

val res = spark.sql(""" with tw as (select t1.name, max(t1.num) as max_val
                          from tab t1 
                         where t1.name in (select distinct t2.name 
                                             from tab t2
                                            where t2.tag = 'b'
                                          )
                      group by t1.name )
                      select distinct tz.name, tz.tag, tz.num
                        from tab tz, tw
                       where tz.name = tw.name
                         and tz.num  = tw.max_val
                   """) 
res.show(false)

你可以在Spark中创建一个UserDefinedAggregateFunction。除非没有其他解决方案,否则不建议使用UDF,我正在尝试使用window,我认为它可以帮助你想做的任何事情。。。只是说,但在这种情况下,你仍然需要一个聚合。好运气谢谢你的解决方案,我用window找到了另一个:`val w=window.partitionBy(“name”)`不止一条路通向罗马……但我不认为partitionBy是这里显而易见的用例。。。答案是正确的。是的,你是对的,还有,窗户的方式很复杂