Scala Spark:基于列值的行过滤器
我有数百万行作为数据帧,如下所示:Scala Spark:基于列值的行过滤器,scala,apache-spark,dataframe,Scala,Apache Spark,Dataframe,我有数百万行作为数据帧,如下所示: val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status") scala> df.show(false) +---+--------+ |id |status | +---+--------+ |id1|
val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
现在我想将这些数据分成三个独立的数据帧,如下所示:
df.select("id").distinct.except(activeDF).except(inactiveDF)
,但这将涉及洗牌(因为“不同”操作需要相同的洗牌)。有没有更好的方法来计算两个HDF
版本:
Spark : 2.2.1
Scala : 2.11
最优雅的解决方案是专注于
状态
val counts = df
.groupBy("id")
.pivot("status", Seq("ACTIVE", "INACTIVE"))
.count
或等效的直接agg
val counts = df
.groupBy("id")
.agg(
count(when($"status" === "ACTIVE", true)) as "ACTIVE",
count(when($"status" === "INACTIVE", true)) as "INACTIVE"
)
然后是一个简单的案例。。。当
:
val result = counts.withColumn(
"status",
when($"ACTIVE" === 0, "INACTIVE")
.when($"inactive" === 0, "ACTIVE")
.otherwise("BOTH")
)
result.show
+--+--+--+--+
|id |活动|非活动|状态|
+---+------+--------+--------+
|id3 | 0 | 2 |不活动|
|id1 | 1 | 2 |两者|
|id2 | 1 | 0 |激活|
+---+------+--------+--------+
稍后,您可以使用
过滤器分离结果
,或使用支持分区方式的源文件将其转储到磁盘。另一种方法-groupBy,collect as set,然后如果集合的大小为1,则它只能是活动的或不活动的,否则两者都可以
scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE |
|id5|ACTIVE |
|id6|INACTIVE|
|id7|ACTIVE |
|id7|INACTIVE|
+---+--------+
scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]
scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE] |
|id5|[ACTIVE] |
|id6|[INACTIVE] |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE] |
|id4|[ACTIVE] |
+---+------------------+
scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus |status |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH |
|id3|[INACTIVE] |INACTIVE|
|id5|[ACTIVE] |ACTIVE |
|id6|[INACTIVE] |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH |
|id2|[ACTIVE] |ACTIVE |
|id4|[ACTIVE] |ACTIVE |
+---+------------------+--------+
scala>val df=Seq((“id1”、“活动”)、(“id1”、“不活动”)、(“id1”、“不活动”)、(“id2”、“活动”)、(“id3”、“不活动”)、(“id4”、“活动”)、(“id5”、“活动”)、(“id6”、“不活动”)、(“id7”、“活动”)、(“id7”、“不活动”)。toDF(“id”、“状态”)
df:org.apache.spark.sql.DataFrame=[id:string,status:string]
scala>df.show(假)
+---+--------+
|身份|身份|
+---+--------+
|id1 |激活|
|id1 |不活动|
|id1 |不活动|
|id2 |激活|
|id3 |不活动|
|id3 |不活动|
|id4 |激活|
|id5 |激活|
|id6 |不活动|
|id7 |激活|
|id7 |不活动|
+---+--------+
scala>val allstatusDF=df.groupBy(“id”).agg(收集集(“状态”)为“allstatus”)
allstatusDF:org.apache.spark.sql.DataFrame=[id:string,allstatus:array]
scala>allstatusDF.show(false)
+---+------------------+
|id |所有状态|
+---+------------------+
|id7 |[活动,非活动]|
|id3 |[非活动]|
|id5 |[活动]|
|id6 |[非活动]|
|id1 |[活动,非活动]|
|id2 |[活动]|
|id4 |[活动]|
+---+------------------+
scala>allstatusDF.withColumn(“status”,当(size($“allstatus”)==1,$“allstatus.getItem(0))。否则(“两者”)。显示(false)
+---+------------------+--------+
|id |所有状态|状态|
+---+------------------+--------+
|id7 |[活动,非活动]|两者|
|id3 |[非活动]|非活动|
|id5 |[活动]|活动|
|id6 |[非活动]|非活动|
|id1 |[活动,非活动]|两者|
|id2 |[活动]|活动|
|id4 |[活动]|活动|
+---+------------------+--------+
谢谢,它起作用了,但Gadipally发布的代码的性能要好于Hanks@Gadipally。使用一个ShuffledRowRDD执行的操作