按时间戳scala更新数据帧值
我有这个数据框按时间戳scala更新数据帧值,scala,apache-spark,dataframe,bigdata,spark-streaming,Scala,Apache Spark,Dataframe,Bigdata,Spark Streaming,我有这个数据框 +----------------+-----------------------------+--------------------+--------------+----------------+ |customerid| | event | A | B | C | +----------------+--------------------
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:50:40.0 |aaaaaa | 25 | 5025 |
| 1222222 | 2019-02-07 06:50:42.0 |aaaaaa | 35 | 5000 |
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
我希望通过事件(tiemstamp)更新列C的值,并在新的dataframe中只保留更新了最新值的行,如下所示
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
使用spark streaming,数据以流模式传输您可以尝试创建按customerid和order by event desc分区的行号,并获取rownum为1的行。我希望这有帮助
df.withColumn("rownum", row_number().over(Window.partitionBy("customerid").orderBy(col("event").desc)))
.filter(col("rownum") === 1)
.drop("rownum")
到目前为止你做了什么?看起来您需要按照某个公共键(groupBy)对行进行分组,然后从每个组中获取一条具有最大时间戳的记录。然后映射每个组的剩余记录。@AlexeyNovakov是的,没错,我有很多带有键的事件,我想自动获取最后一个事件的时间戳和最后一个C值