Python 展平pyspark数据帧以获取每个特定值和字段的时间戳
我尝试通过以下方式查找每个列属性的值变化:Python 展平pyspark数据帧以获取每个特定值和字段的时间戳,python,pandas,apache-spark,pyspark,Python,Pandas,Apache Spark,Pyspark,我尝试通过以下方式查找每个列属性的值变化: windowSpec = Window.partitionBy("attribute").orderBy(df_series['time'].asc()) final_df_series = df_series.withColumn('lagdate',f.lag(df_series['time'],-1).over(windowSpec))\ .withColumn("value_lagvalue$df",(
windowSpec = Window.partitionBy("attribute").orderBy(df_series['time'].asc())
final_df_series = df_series.withColumn('lagdate',f.lag(df_series['time'],-1).over(windowSpec))\
.withColumn("value_lagvalue$df",(f.lag(df_series["value"],-1).over(windowSpec)))\
.withColumn("value_grp$df",(f.col("value") - f.col("value_lagvalue$df")).cast("int"))\
.filter(F.col("value_grp$df") != 0).drop(F.col("value_grp$df"))\
.select("attribute","lagdate","value_lagvalue$df").persist()
上述代码的dataframe输出为:
+---------+-------------------+-----------------+
|attribute| lagdate|value_lagvalue$df|
+---------+-------------------+-----------------+
| column93|2020-09-07 10:29:24| 3|
| column93|2020-09-07 10:29:38| 1|
| column93|2020-09-07 10:31:08| 0|
| column94|2020-09-07 10:29:26| 3|
| column94|2020-09-07 10:29:40| 1|
| column94|2020-09-07 10:31:18| 0|
|column281|2020-09-07 10:29:34| 3|
|column281|2020-09-07 10:29:54| 0|
|column281|2020-09-07 10:31:08| 3|
|column281|2020-09-07 10:31:13| 0|
|column281|2020-09-07 10:35:24| 3|
|column281|2020-09-07 10:36:08| 0|
|column282|2020-09-07 10:41:13| 3|
|column282|2020-09-07 10:49:24| 1|
|column284|2020-09-07 10:51:08| 1|
|column284|2020-09-07 11:01:13| 0|
|column285|2020-09-07 11:21:13| 1|
+---------+-------------------+-----------------+
我想把它转换成以下结构
attribute,timestamp_3,timestamp_1,timestamp_0
column93,2020-09-07 10:29:24,2020-09-07 10:29:38,2020-09-07 10:31:08
column94,2020-09-07 10:29:26,2020-09-07 10:29:40,2020-09-07 10:31:18
column281,2020-09-07 10:29:34,null,2020-09-07 10:29:54
column281,2020-09-07 10:31:08,null,2020-09-07 10:31:13
column281,2020-09-07 10:35:24,null,2020-09-07 10:36:08
column282,2020-09-07 10:41:13,2020-09-07 10:49:24,null
column284,null,2020-09-07 10:51:08,2020-09-07 11:01:13
column285,null,2020-09-07 11:21:13,null
感谢您的帮助。(pyspark
中的解决方案更可取,因为它在本质上针对此类大型数据帧进行了优化,但在pandas中也非常有用)
更新:
这篇文章似乎达到了几乎相同的目的。希望社会各界帮助实现预期目标
您可以分组并使用array_list(),但不确定这是否有帮助,时间戳列将显示在list@dsk需要维护每个属性的时间戳序列independently@dsk更新了一些帮助链接