Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
按时间戳scala更新数据帧值_Scala_Apache Spark_Dataframe_Bigdata_Spark Streaming - Fatal编程技术网

按时间戳scala更新数据帧值

按时间戳scala更新数据帧值,scala,apache-spark,dataframe,bigdata,spark-streaming,Scala,Apache Spark,Dataframe,Bigdata,Spark Streaming,我有这个数据框 +----------------+-----------------------------+--------------------+--------------+----------------+ |customerid| | event | A | B | C | +----------------+--------------------

我有这个数据框

+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid|     |  event                      | A                  | B            |    C           |
+----------------+-----------------------------+--------------------+--------------+----------------+
|     1222222    | 2019-02-07 06:50:40.0       |aaaaaa              | 25           | 5025           |
|     1222222    | 2019-02-07 06:50:42.0       |aaaaaa              | 35           | 5000           |
|     1222222    | 2019-02-07 06:51:56.0       |aaaaaa              | 100          | 4965           |
+----------------+-----------------------------+--------------------+--------------+----------------+
我希望通过事件(tiemstamp)更新列C的值,并在新的dataframe中只保留更新了最新值的行,如下所示

+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid|     |  event                      | A                  | B            |    C           |
+----------------+-----------------------------+--------------------+--------------+----------------+
|     1222222    | 2019-02-07 06:51:56.0       |aaaaaa              | 100          | 4965           |
+----------------+-----------------------------+--------------------+--------------+----------------+

使用spark streaming,数据以流模式传输

您可以尝试创建按customerid和order by event desc分区的行号,并获取rownum为1的行。我希望这有帮助

df.withColumn("rownum", row_number().over(Window.partitionBy("customerid").orderBy(col("event").desc)))
    .filter(col("rownum") === 1)
    .drop("rownum")

到目前为止你做了什么?看起来您需要按照某个公共键(groupBy)对行进行分组,然后从每个组中获取一条具有最大时间戳的记录。然后映射每个组的剩余记录。@AlexeyNovakov是的,没错,我有很多带有键的事件,我想自动获取最后一个事件的时间戳和最后一个C值