基于spark scala中的条件的CountDistinct
我有下面的数据框基于spark scala中的条件的CountDistinct,scala,apache-spark,Scala,Apache Spark,我有下面的数据框 +-------+---+----+ |Company|EMP|Flag| +-------+---+----+ | M| c1| Y| | M| c1| Y| | M| c2| N| | M| c2| N| | M| c3| Y| | M| c3| Y| | M| c4| N| | M| c4| N| | M| c5| Y| | M| c5
+-------+---+----+
|Company|EMP|Flag|
+-------+---+----+
| M| c1| Y|
| M| c1| Y|
| M| c2| N|
| M| c2| N|
| M| c3| Y|
| M| c3| Y|
| M| c4| N|
| M| c4| N|
| M| c5| Y|
| M| c5| Y|
| M| c6| Y|
+-------+---+----+
创建人-
val df1=Seq(
("M","c1","Y"),
("M","c1","Y"),
("M","c2","N"),
("M","c2","N"),
("M","c3","Y"),
("M","c3","Y"),
("M","c4","N"),
("M","c4","N"),
("M","c5","Y"),
("M","c5","Y"),
("M","c6","Y")
)toDF("Company","EMP","Flag")
当FLAG=Y和FLAG=N时,我如何计算不同的EMP计数。一旦一个EMP有了一个FLAG,它就不会再改变了。我可以用distinct实现这一点。但是有没有办法在不使用distinct的情况下实现这一点(这是为了避免代码中的额外连接)
预期产出:
+---+---+---+---------+----------+
| M| Y| N|Total_ROWs|Unique_Emp|
+---+---+---+---------+----------+
| M| 4| 2| 11| 6|
+---+---+---+---------+----------+
这个怎么样
df1.groupBy("Company", "EMP", "Flag")
.agg(count("Company").as("Row"))
.groupBy("Company", "EMP", "Flag")
.agg(count("Flag").as("YN"), sum("Row").as("Row"))
.groupBy("Company")
.agg(count(when($"Flag" === "Y", 1)).as("Y"), count(when($"Flag" === "N", 1)).as("N"), sum("Row").as("Total_ROWs"), count("EMP").as("Unique_EMP"))
.show
+-------+---+---+----------+----------+
|Company| Y| N|Total_ROWs|Unique_EMP|
+-------+---+---+----------+----------+
| M| 4| 2| 11| 6|
+-------+---+---+----------+----------+
你能帮个忙吗?无界退场。这是必需的吗?可能是,我已经尝试使用max或last,但在第一个窗口中,它只找到当前行之前的值。抱歉!!我想我没有正确地发布我的问题。我已经编辑过了。嗨@Lamanus,很抱歉回复太晚,因为我遇到了另一个紧急情况,无法检查这个。这很有帮助。谢谢:)