Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/285.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 展平pyspark数据帧以获取每个特定值和字段的时间戳_Python_Pandas_Apache Spark_Pyspark - Fatal编程技术网

Python 展平pyspark数据帧以获取每个特定值和字段的时间戳

Python 展平pyspark数据帧以获取每个特定值和字段的时间戳,python,pandas,apache-spark,pyspark,Python,Pandas,Apache Spark,Pyspark,我尝试通过以下方式查找每个列属性的值变化: windowSpec = Window.partitionBy("attribute").orderBy(df_series['time'].asc()) final_df_series = df_series.withColumn('lagdate',f.lag(df_series['time'],-1).over(windowSpec))\ .withColumn("value_lagvalue$df",(

我尝试通过以下方式查找每个列属性的值变化:

windowSpec = Window.partitionBy("attribute").orderBy(df_series['time'].asc())

final_df_series = df_series.withColumn('lagdate',f.lag(df_series['time'],-1).over(windowSpec))\
.withColumn("value_lagvalue$df",(f.lag(df_series["value"],-1).over(windowSpec)))\
.withColumn("value_grp$df",(f.col("value") - f.col("value_lagvalue$df")).cast("int"))\
.filter(F.col("value_grp$df") != 0).drop(F.col("value_grp$df"))\
.select("attribute","lagdate","value_lagvalue$df").persist()

上述代码的dataframe输出为:

+---------+-------------------+-----------------+
|attribute|            lagdate|value_lagvalue$df|
+---------+-------------------+-----------------+
| column93|2020-09-07 10:29:24|                3|
| column93|2020-09-07 10:29:38|                1|
| column93|2020-09-07 10:31:08|                0|

| column94|2020-09-07 10:29:26|                3|
| column94|2020-09-07 10:29:40|                1|
| column94|2020-09-07 10:31:18|                0|

|column281|2020-09-07 10:29:34|                3|
|column281|2020-09-07 10:29:54|                0|
|column281|2020-09-07 10:31:08|                3|
|column281|2020-09-07 10:31:13|                0|
|column281|2020-09-07 10:35:24|                3|
|column281|2020-09-07 10:36:08|                0|

|column282|2020-09-07 10:41:13|                3|
|column282|2020-09-07 10:49:24|                1|

|column284|2020-09-07 10:51:08|                1|
|column284|2020-09-07 11:01:13|                0|

|column285|2020-09-07 11:21:13|                1|
+---------+-------------------+-----------------+
我想把它转换成以下结构

attribute,timestamp_3,timestamp_1,timestamp_0
column93,2020-09-07 10:29:24,2020-09-07 10:29:38,2020-09-07 10:31:08
column94,2020-09-07 10:29:26,2020-09-07 10:29:40,2020-09-07 10:31:18
column281,2020-09-07 10:29:34,null,2020-09-07 10:29:54
column281,2020-09-07 10:31:08,null,2020-09-07 10:31:13
column281,2020-09-07 10:35:24,null,2020-09-07 10:36:08
column282,2020-09-07 10:41:13,2020-09-07 10:49:24,null
column284,null,2020-09-07 10:51:08,2020-09-07 11:01:13
column285,null,2020-09-07 11:21:13,null
感谢您的帮助。(
pyspark
中的解决方案更可取,因为它在本质上针对此类大型数据帧进行了优化,但在pandas中也非常有用)

更新:

这篇文章似乎达到了几乎相同的目的。希望社会各界帮助实现预期目标


您可以分组并使用array_list(),但不确定这是否有帮助,时间戳列将显示在list@dsk需要维护每个属性的时间戳序列independently@dsk更新了一些帮助链接