Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark:基于列值的行过滤器_Scala_Apache Spark_Dataframe - Fatal编程技术网

Scala Spark:基于列值的行过滤器

Scala Spark:基于列值的行过滤器,scala,apache-spark,dataframe,Scala,Apache Spark,Dataframe,我有数百万行作为数据帧,如下所示: val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status") scala> df.show(false) +---+--------+ |id |status | +---+--------+ |id1|

我有数百万行作为数据帧,如下所示:

val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")

scala> df.show(false)
+---+--------+
|id |status  |
+---+--------+
|id1|ACTIVE  |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE  |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
现在我想将这些数据分成三个独立的数据帧,如下所示:

  • 只有活动ID(如id2),比如activeDF
  • 只有不活动的ID(如id3),比如不活动的VEDF
  • 具有活动和非活动状态,如bothDF
  • 如何计算activeDF和inactiveDF

    我知道bothDF可以这样计算

    df.select("id").distinct.except(activeDF).except(inactiveDF)
    
    ,但这将涉及洗牌(因为“不同”操作需要相同的洗牌)。有没有更好的方法来计算两个HDF

    版本:

    Spark : 2.2.1
    Scala : 2.11
    

    最优雅的解决方案是专注于
    状态

    val counts = df
      .groupBy("id")
      .pivot("status", Seq("ACTIVE", "INACTIVE"))
      .count
    
    或等效的直接
    agg

    val counts = df
      .groupBy("id")
      .agg(
        count(when($"status" === "ACTIVE", true)) as "ACTIVE",
        count(when($"status" === "INACTIVE", true)) as "INACTIVE"
      )
    
    然后是一个简单的
    案例。。。当

    val result = counts.withColumn(
      "status",
      when($"ACTIVE" === 0, "INACTIVE")
        .when($"inactive" === 0, "ACTIVE")
        .otherwise("BOTH")
    )
    
    result.show
    
    +--+--+--+--+
    |id |活动|非活动|状态|
    +---+------+--------+--------+
    |id3 | 0 | 2 |不活动|
    |id1 | 1 | 2 |两者|
    |id2 | 1 | 0 |激活|
    +---+------+--------+--------+
    

    稍后,您可以使用
    过滤器分离
    结果
    ,或使用支持
    分区方式的源文件将其转储到磁盘。

    另一种方法-groupBy,collect as set,然后如果集合的大小为1,则它只能是活动的或不活动的,否则两者都可以

    scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
    df: org.apache.spark.sql.DataFrame = [id: string, status: string]
    
    scala> df.show(false)
    +---+--------+
    |id |status  |
    +---+--------+
    |id1|ACTIVE  |
    |id1|INACTIVE|
    |id1|INACTIVE|
    |id2|ACTIVE  |
    |id3|INACTIVE|
    |id3|INACTIVE|
    |id4|ACTIVE  |
    |id5|ACTIVE  |
    |id6|INACTIVE|
    |id7|ACTIVE  |
    |id7|INACTIVE|
    +---+--------+
    
    
    scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
    allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]
    
    scala> allstatusDF.show(false)
    +---+------------------+
    |id |allstatus         |
    +---+------------------+
    |id7|[ACTIVE, INACTIVE]|
    |id3|[INACTIVE]        |
    |id5|[ACTIVE]          |
    |id6|[INACTIVE]        |
    |id1|[ACTIVE, INACTIVE]|
    |id2|[ACTIVE]          |
    |id4|[ACTIVE]          |
    +---+------------------+
    
    
    scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
    +---+------------------+--------+
    |id |allstatus         |status  |
    +---+------------------+--------+
    |id7|[ACTIVE, INACTIVE]|BOTH    |
    |id3|[INACTIVE]        |INACTIVE|
    |id5|[ACTIVE]          |ACTIVE  |
    |id6|[INACTIVE]        |INACTIVE|
    |id1|[ACTIVE, INACTIVE]|BOTH    |
    |id2|[ACTIVE]          |ACTIVE  |
    |id4|[ACTIVE]          |ACTIVE  |
    +---+------------------+--------+
    
    scala>val df=Seq((“id1”、“活动”)、(“id1”、“不活动”)、(“id1”、“不活动”)、(“id2”、“活动”)、(“id3”、“不活动”)、(“id4”、“活动”)、(“id5”、“活动”)、(“id6”、“不活动”)、(“id7”、“活动”)、(“id7”、“不活动”)。toDF(“id”、“状态”)
    df:org.apache.spark.sql.DataFrame=[id:string,status:string]
    scala>df.show(假)
    +---+--------+
    |身份|身份|
    +---+--------+
    |id1 |激活|
    |id1 |不活动|
    |id1 |不活动|
    |id2 |激活|
    |id3 |不活动|
    |id3 |不活动|
    |id4 |激活|
    |id5 |激活|
    |id6 |不活动|
    |id7 |激活|
    |id7 |不活动|
    +---+--------+
    scala>val allstatusDF=df.groupBy(“id”).agg(收集集(“状态”)为“allstatus”)
    allstatusDF:org.apache.spark.sql.DataFrame=[id:string,allstatus:array]
    scala>allstatusDF.show(false)
    +---+------------------+
    |id |所有状态|
    +---+------------------+
    |id7 |[活动,非活动]|
    |id3 |[非活动]|
    |id5 |[活动]|
    |id6 |[非活动]|
    |id1 |[活动,非活动]|
    |id2 |[活动]|
    |id4 |[活动]|
    +---+------------------+
    scala>allstatusDF.withColumn(“status”,当(size($“allstatus”)==1,$“allstatus.getItem(0))。否则(“两者”)。显示(false)
    +---+------------------+--------+
    |id |所有状态|状态|
    +---+------------------+--------+
    |id7 |[活动,非活动]|两者|
    |id3 |[非活动]|非活动|
    |id5 |[活动]|活动|
    |id6 |[非活动]|非活动|
    |id1 |[活动,非活动]|两者|
    |id2 |[活动]|活动|
    |id4 |[活动]|活动|
    +---+------------------+--------+
    
    谢谢,它起作用了,但Gadipally发布的代码的性能要好于Hanks@Gadipally。使用一个ShuffledRowRDD执行的操作