Apache spark 如何根据spark中的分组后的值进行筛选
假设我有以下数据框:Apache spark 如何根据spark中的分组后的值进行筛选,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,假设我有以下数据框: val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show +----+---+---+ |name|tag|num| +----+---+---+ | aa| b| 1| | aa| c| 5| | aa| d| 0| | xx| y| 5| | z| zz| 9|
val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show
+----+---+---+
|name|tag|num|
+----+---+---+
| aa| b| 1|
| aa| c| 5|
| aa| d| 0|
| xx| y| 5|
| z| zz| 9|
| z| b| 12|
+----+---+---+
我想筛选此数据帧,以便:对于每组数据(按名称分组),如果列标记的值为“b”,我将取num列的最大值,否则我将忽略该行
这是我想要的结果:
+----+---+---+
|name|tag|num|
+----+---+---+
| aa| c| 5|
| z| b| 12|
+----+---+---+
解释
- 名为'aa'的行组是一行,其中tag的值为'b',因此我取这个组的num的最大值为5
- 名称为'xx'的行组中没有tag值为'b'的行,因此它是w
- 名称为class='z'的行组是一行,其中tag的值=='b',因此我取这个组的num的最大值为12
val df=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num")
df.createOrReplaceTempView("tab")
val res = spark.sql(""" with tw as (select t1.name, max(t1.num) as max_val
from tab t1
where t1.name in (select distinct t2.name
from tab t2
where t2.tag = 'b'
)
group by t1.name )
select distinct tz.name, tz.tag, tz.num
from tab tz, tw
where tz.name = tw.name
and tz.num = tw.max_val
""")
res.show(false)
你可以在Spark中创建一个UserDefinedAggregateFunction。除非没有其他解决方案,否则不建议使用UDF,我正在尝试使用window,我认为它可以帮助你想做的任何事情。。。只是说,但在这种情况下,你仍然需要一个聚合。好运气谢谢你的解决方案,我用window找到了另一个:`val w=window.partitionBy(“name”)`不止一条路通向罗马……但我不认为partitionBy是这里显而易见的用例。。。答案是正确的。是的,你是对的,还有,窗户的方式很复杂