Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark在一系列值中查找空值块_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark Spark在一系列值中查找空值块

Apache spark Spark在一系列值中查找空值块,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,假设这是我的数据: date value 2016-01-01 1 2016-01-02 NULL 2016-01-03 NULL 2016-01-04 2 2016-01-05 3 2016-01-06 NULL 2016-01-07 NULL 2016-01-08 NULL 2016-01-09 1 我试图找到围绕空值组的开始和结束日期。示例输出如下所示: start end 2016-01-01 2016-01-04

假设这是我的数据:

date         value
2016-01-01   1
2016-01-02   NULL
2016-01-03   NULL
2016-01-04   2
2016-01-05   3
2016-01-06   NULL
2016-01-07   NULL
2016-01-08   NULL
2016-01-09   1
我试图找到围绕空值组的开始和结束日期。示例输出如下所示:

start        end
2016-01-01   2016-01-04
2016-01-05   2016-01-09
我对该问题的第一次尝试产生了以下结果:

df.filter($"value".isNull)\
    .agg(to_date(date_add(max("date"), 1)) as "max", 
         to_date(date_sub(min("date"),1)) as "min"
        )

但这只能找到总的最小值和最大值。我曾想过使用groupBy,但不知道如何为每个空值块创建列。

我没有可行的解决方案,但我有一些建议

);您还必须稍微更改该代码以生成一个前导列

现在假设你有你的滞后和领先列。生成的数据帧现在如下所示:

date         value     lag_value     lead_value
2016-01-01   1         NULL          1 
2016-01-02   NULL      NULL          1
2016-01-03   NULL      2             NULL
2016-01-04   2         3             NULL
2016-01-05   3         NULL          2
2016-01-06   NULL      NULL          3
2016-01-07   NULL      NULL          NULL
2016-01-08   NULL      1             NULL
2016-01-09   1         1             NULL
from pyspark.sql import functions as F
df_2.withColumn('group_date_type', 
                F.when("value IS NOT NULL AND lag_value IS NULL", start)\
                  .when("value IS NULL AND lead_value IS NOT NULL", end)\
                  .otherwise(None)
                 )
现在,您要做的只是根据以下条件进行过滤:

min date:
df.filter("value IS NOT NULL AND lag_value IS NULL")

max date:
df.filter("value IS NULL AND lead_value IS NOT NULL")
如果您想更高级一点,还可以使用
when
命令创建一个新列,说明该日期是空组的开始日期还是结束日期:

date         value     lag_value     lead_value   group_date_type
2016-01-01   1         NULL          1            start
2016-01-02   NULL      NULL          1            NULL
2016-01-03   NULL      2             NULL         NULL   
2016-01-04   2         3             NULL         end
2016-01-05   3         NULL          2            start
2016-01-06   NULL      NULL          3            NULL
2016-01-07   NULL      NULL          NULL         NULL
2016-01-08   NULL      1             NULL         NULL
2016-01-09   1         1             NULL         end 
这可以通过如下方式创建:

date         value     lag_value     lead_value
2016-01-01   1         NULL          1 
2016-01-02   NULL      NULL          1
2016-01-03   NULL      2             NULL
2016-01-04   2         3             NULL
2016-01-05   3         NULL          2
2016-01-06   NULL      NULL          3
2016-01-07   NULL      NULL          NULL
2016-01-08   NULL      1             NULL
2016-01-09   1         1             NULL
from pyspark.sql import functions as F
df_2.withColumn('group_date_type', 
                F.when("value IS NOT NULL AND lag_value IS NULL", start)\
                  .when("value IS NULL AND lead_value IS NOT NULL", end)\
                  .otherwise(None)
                 )

棘手的部分是获取组的边界,因此需要几个步骤

  • 首先构建空/非空组(使用窗口函数)
  • 然后按块分组以获得块内的边界
  • 然后再次使用窗口函数来扩展边框
下面是一个工作示例:

import ss.implicits._

val df = Seq(
  ("2016-01-01", Some(1)),
  ("2016-01-02", None),
  ("2016-01-03", None),
  ("2016-01-04", Some(2)),
  ("2016-01-05", Some(3)),
  ("2016-01-06", None),
  ("2016-01-07", None),
  ("2016-01-08", None),
  ("2016-01-09", Some(1))
).toDF("date", "value")


df
  // build blocks
  .withColumn("isnull", when($"value".isNull, true).otherwise(false))
  .withColumn("lag_isnull", lag($"isnull",1).over(Window.orderBy($"date")))
  .withColumn("change", coalesce($"isnull"=!=$"lag_isnull",lit(false)))
  .withColumn("block", sum($"change".cast("int")).over(Window.orderBy($"date")))
  // now calculate min/max within groups
  .groupBy($"block")
  .agg(
    min($"date").as("tmp_min"),
    max($"date").as("tmp_max"),
    (count($"value")===0).as("null_block")
  )
  // now extend groups to include borders
  .withColumn("min", lag($"tmp_max", 1).over(Window.orderBy($"tmp_min")))
  .withColumn("max", lead($"tmp_min", 1).over(Window.orderBy($"tmp_max")))
  // only select null-groups
  .where($"null_block")
  .select($"min", $"max")
  .orderBy($"min")
  .show()
给予


不知道为什么这不是一个有效的解决方案,对我有效,谢谢