Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 基于多种条件,在Spark上优雅地合并行_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 基于多种条件,在Spark上优雅地合并行

Scala 基于多种条件,在Spark上优雅地合并行,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,嗨,斯塔克 目前正试图找到一种优雅的方式来进行特定的转换 我有一个动作的数据框架,看起来像这样: +---------+----------+----------+---------+ |timestamp| user_id| action| value| +---------+----------+----------+---------+ | 100| 1| click| null| | 101| 2|

嗨,斯塔克

目前正试图找到一种优雅的方式来进行特定的转换

我有一个动作的数据框架,看起来像这样:

+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      103|         1|      drag|      AAA|
|      101|         1|     click|     null|
|      108|         1|     click|     null|
|      100|         2|     click|     null|
|      106|         1|      drag|      BBB|
+---------+----------+----------+---------+
+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      101|         1|     click|      AAA|
|      108|         1|     click|      BBB|
|      100|         2|     click|     null|
+---------+----------+----------+---------+
val window =  Window partitionBy ($"user_id") orderBy $"timestamp".asc

myDF
  .withColumn("previous_value", lag("value", 1, null) over window)
  .withColumn("previous_timestamp", lag("timestamp", 1, null) over window)
  .withColumn("next_value", lead("value", 1, null) over window)
  .withColumn("next_timestamp", lead("timestamp", 1, null) over window)

  .withColumn("value",
        when(
            $"previous_value".isNotNull and
            // If there is more than 5 sec. difference, it shouldn't be joined
            $"timestamp" - $"previous_timestamp" < 5 and
            (
                $"next_timestamp".isNull or
                $"next_timestamp" - $"timestamp" > $"timestamp" - $"previous_timestamp"
            ), $"previous_value")
        .otherwise(
            when($"next_timestamp" - $"timestamp" < 5, $"next_value")
            .otherwise(null)
        )
    )
  .filter($"action" === "click")
  .drop("previous_value")
  .drop("previous_timestamp")
  .drop("next_value")
  .drop("next_timestamp")
背景: 用户可以执行以下操作:单击和拖动。单击没有值,拖动有值。拖动意味着有一个点击,但不是相反。我们还假设拖动事件可以在单击事件之后或之前记录。 因此,对于每次拖动,我都有一个相应的点击动作。我想做的是,将拖动和单击操作合并为1,即在为单击操作指定值后删除拖动操作

要知道哪个单击对应于哪个拖动,我必须选择时间戳最接近拖动时间戳的单击。我还想确保,如果时间戳差异超过5,则拖动不能链接到单击,这意味着某些拖动可能没有链接,这很好。当然,我不希望用户1的拖动与用户2的单击相对应

在这里,结果如下所示:

+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      103|         1|      drag|      AAA|
|      101|         1|     click|     null|
|      108|         1|     click|     null|
|      100|         2|     click|     null|
|      106|         1|      drag|      BBB|
+---------+----------+----------+---------+
+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      101|         1|     click|      AAA|
|      108|         1|     click|      BBB|
|      100|         2|     click|     null|
+---------+----------+----------+---------+
val window =  Window partitionBy ($"user_id") orderBy $"timestamp".asc

myDF
  .withColumn("previous_value", lag("value", 1, null) over window)
  .withColumn("previous_timestamp", lag("timestamp", 1, null) over window)
  .withColumn("next_value", lead("value", 1, null) over window)
  .withColumn("next_timestamp", lead("timestamp", 1, null) over window)

  .withColumn("value",
        when(
            $"previous_value".isNotNull and
            // If there is more than 5 sec. difference, it shouldn't be joined
            $"timestamp" - $"previous_timestamp" < 5 and
            (
                $"next_timestamp".isNull or
                $"next_timestamp" - $"timestamp" > $"timestamp" - $"previous_timestamp"
            ), $"previous_value")
        .otherwise(
            when($"next_timestamp" - $"timestamp" < 5, $"next_value")
            .otherwise(null)
        )
    )
  .filter($"action" === "click")
  .drop("previous_value")
  .drop("previous_timestamp")
  .drop("next_value")
  .drop("next_timestamp")
AAA timestamp=103的拖动链接到101处发生的单击,因为它最接近103。BBB的逻辑也一样

因此,我希望以平稳/高效的方式执行这些操作。到目前为止,我有这样的想法:

+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      103|         1|      drag|      AAA|
|      101|         1|     click|     null|
|      108|         1|     click|     null|
|      100|         2|     click|     null|
|      106|         1|      drag|      BBB|
+---------+----------+----------+---------+
+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      101|         1|     click|      AAA|
|      108|         1|     click|      BBB|
|      100|         2|     click|     null|
+---------+----------+----------+---------+
val window =  Window partitionBy ($"user_id") orderBy $"timestamp".asc

myDF
  .withColumn("previous_value", lag("value", 1, null) over window)
  .withColumn("previous_timestamp", lag("timestamp", 1, null) over window)
  .withColumn("next_value", lead("value", 1, null) over window)
  .withColumn("next_timestamp", lead("timestamp", 1, null) over window)

  .withColumn("value",
        when(
            $"previous_value".isNotNull and
            // If there is more than 5 sec. difference, it shouldn't be joined
            $"timestamp" - $"previous_timestamp" < 5 and
            (
                $"next_timestamp".isNull or
                $"next_timestamp" - $"timestamp" > $"timestamp" - $"previous_timestamp"
            ), $"previous_value")
        .otherwise(
            when($"next_timestamp" - $"timestamp" < 5, $"next_value")
            .otherwise(null)
        )
    )
  .filter($"action" === "click")
  .drop("previous_value")
  .drop("previous_timestamp")
  .drop("next_value")
  .drop("next_timestamp")
但我觉得这是相当低效的。有更好的方法吗?无需创建4个临时列即可完成的操作。。。 例如,是否有方法在同一表达式中同时处理偏移量为-1和+1的行


提前谢谢

我尝试使用Spark SQL而不是DataFrame API,但应该可以转换:

myDF.RegisterEmptableMydf spark.sql 具有 单击“从mydf中选择*表格,其中action='click' ,拖动表格,从mydf中选择*,其中action='drag' ,一次单击多次拖动 选择 c、 作为c_时间戳的时间戳 ,d.时间戳作为d_时间戳 ,c.user\u id作为c\u user\u id ,d.user\u id作为d\u user\u id ,c.作为c_动作的动作 ,d.动作作为d_动作 ,c.值作为c_值 ,d.值作为d_值 来自表c 表d的内部联接阻力 在c.user\u id=d.user\u id上
时间戳我曾经考虑过连接方式,但无法得到完全正确的东西,使用秩确实是我错过的!这已经干净多了,谢谢!很高兴我能帮忙!请注意,如果timestamp+user\u id不是唯一的,则此方法无法正常工作!是的,但我简化了我的问题:每个事件实际上都有一个ID,作为主键;但是感谢您的关注,感谢您花时间找到这个好的解决方案!