Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark pyspark中两个时间戳之间的运行和_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark pyspark中两个时间戳之间的运行和

Apache spark pyspark中两个时间戳之间的运行和,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有以下格式的数据: +---------------------+----+----+---------+----------+ | date_time | id | cm | p_count | bcm | +---------------------+----+----+---------+----------+ | 2018-02-01 04:38:00 | v1 | c1 | 1 | null | | 2018-02-01 05:37:

我有以下格式的数据:

+---------------------+----+----+---------+----------+
|      date_time      | id | cm | p_count |   bcm    |
+---------------------+----+----+---------+----------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |
+---------------------+----+----+---------+----------+
我需要找到两个日期时间和分区id之间的p_计数列的滚动和

滚动求和窗口的开始日期时间和结束日期时间的逻辑如下:

start_date_time=min(date_time) group by (id,cm)

end_date_time= bcm == cm ? date_time : null
在这种情况下,开始日期时间=2018-02-01 04:38:00,结束日期时间=2018-02-01 12:09:19

输出应如下所示:

+---------------------+----+----+---------+----------+-------------+
|      date_time      | id | cm | p_count |   bcm    | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |1            |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |2            |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |3            |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |4            |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |1            |
+---------------------+----+----+---------+----------+-------------+
结果:

+---------------------+----+----+---------+----------+-------------+
|      date_time      | id | cm | p_count |   bcm    | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |1            |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |2            |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |3            |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |4            |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |1            |
+---------------------+----+----+---------+----------+-------------+
        input.createOrReplaceTempView("input_Table");
        import org.apache.spark.sql.expressions.Window
        import org.apache.spark.sql.functions._

        //val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")

        val results = sqlContext.sql("select *, " +
          "SUM(p_count) over ( order by id  rows between unbounded preceding and current row ) cumulative_Sum " +
          "from input_Table ").show
+-------------------+---+---+-------+----+--------------+
|          date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1|      1|null|             1|
|2018-02-01 05:37:07| v1| c1|      1|null|             2|
|2018-02-01 11:19:38| v1| c1|      1|null|             3|
|2018-02-01 12:09:19| v1| c1|      1|  c1|             4|
|2018-02-01 14:05:10| v2| c2|      1|  c2|             5|
+-------------------+---+---+-------+----+--------------+
下一个代码:

+---------------------+----+----+---------+----------+-------------+
|      date_time      | id | cm | p_count |   bcm    | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |1            |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |2            |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |3            |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |4            |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |1            |
+---------------------+----+----+---------+----------+-------------+
        input.createOrReplaceTempView("input_Table");
        import org.apache.spark.sql.expressions.Window
        import org.apache.spark.sql.functions._

        //val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")

        val results = sqlContext.sql("select *, " +
          "SUM(p_count) over ( order by id  rows between unbounded preceding and current row ) cumulative_Sum " +
          "from input_Table ").show
+-------------------+---+---+-------+----+--------------+
|          date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1|      1|null|             1|
|2018-02-01 05:37:07| v1| c1|      1|null|             2|
|2018-02-01 11:19:38| v1| c1|      1|null|             3|
|2018-02-01 12:09:19| v1| c1|      1|  c1|             4|
|2018-02-01 14:05:10| v2| c2|      1|  c2|             5|
+-------------------+---+---+-------+----+--------------+
结果:

+---------------------+----+----+---------+----------+-------------+
|      date_time      | id | cm | p_count |   bcm    | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |1            |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |2            |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |3            |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |4            |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |1            |
+---------------------+----+----+---------+----------+-------------+
        input.createOrReplaceTempView("input_Table");
        import org.apache.spark.sql.expressions.Window
        import org.apache.spark.sql.functions._

        //val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")

        val results = sqlContext.sql("select *, " +
          "SUM(p_count) over ( order by id  rows between unbounded preceding and current row ) cumulative_Sum " +
          "from input_Table ").show
+-------------------+---+---+-------+----+--------------+
|          date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1|      1|null|             1|
|2018-02-01 05:37:07| v1| c1|      1|null|             2|
|2018-02-01 11:19:38| v1| c1|      1|null|             3|
|2018-02-01 12:09:19| v1| c1|      1|  c1|             4|
|2018-02-01 14:05:10| v2| c2|      1|  c2|             5|
+-------------------+---+---+-------+----+--------------+
您需要在打开窗口时按分组,并添加逻辑以获得预期的结果

无界前一行和当前行之间的行

逻辑上,根据起始行和结束行之间的所有行,为分区内的每一行新计算一个窗口聚合函数

根据以下关键字,起始行和结束行可能是固定的或相对于当前行的:

  • 无界前置,当前行之前的所有行->固定
  • 无界跟随,当前行之后的所有行->固定
  • 前x行,当前行前x行->相对
  • y后续,y行在当前行之后->相对
可能的计算类型包括:

+---------------------+----+----+---------+----------+-------------+
|      date_time      | id | cm | p_count |   bcm    | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |1            |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |2            |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |3            |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |4            |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |1            |
+---------------------+----+----+---------+----------+-------------+
        input.createOrReplaceTempView("input_Table");
        import org.apache.spark.sql.expressions.Window
        import org.apache.spark.sql.functions._

        //val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")

        val results = sqlContext.sql("select *, " +
          "SUM(p_count) over ( order by id  rows between unbounded preceding and current row ) cumulative_Sum " +
          "from input_Table ").show
+-------------------+---+---+-------+----+--------------+
|          date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1|      1|null|             1|
|2018-02-01 05:37:07| v1| c1|      1|null|             2|
|2018-02-01 11:19:38| v1| c1|      1|null|             3|
|2018-02-01 12:09:19| v1| c1|      1|  c1|             4|
|2018-02-01 14:05:10| v2| c2|      1|  c2|             5|
+-------------------+---+---+-------+----+--------------+
起始行和结束行都是固定的,窗口由分区的所有行组成,例如组和,即聚合加详细行

一端固定,另一端相对于当前行,行数增加或减少,例如,运行总数、剩余和

起始行和结束行相对于当前行,窗口中的行数是固定的,例如,n行上的移动平均数

因此,SUM(x)OVER(orderby col ROWS UNBOUNDED previous)会导致累积和或运行总和

11 -> 11
 2 -> 11 +  2                = 13
 3 -> 13 +  3 (or 11+2+3)    = 16
44 -> 16 + 44 (or 11+2+3+44) = 60

嘿,瓦夸尔,谢谢你的回复和解释,但这不是我想要的。我需要应用日期范围条件和按访客id划分的分区。在您的解决方案中,已完成简单的滚动求和。