pySpark windows分区sortby而不是order by(感叹号)
这是我当前的数据集pySpark windows分区sortby而不是order by(感叹号),pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,这是我当前的数据集 +----------+--------------------+---------+--------+ |session_id| timestamp| item_id|category| +----------+--------------------+---------+--------+ | 1|2014-04-07 10:51:...|214536502| 0| | 1|2014-04-07 10:54
+----------+--------------------+---------+--------+
|session_id| timestamp| item_id|category|
+----------+--------------------+---------+--------+
| 1|2014-04-07 10:51:...|214536502| 0|
| 1|2014-04-07 10:54:...|214536500| 0|
| 1|2014-04-07 10:54:...|214536506| 0|
| 1|2014-04-07 10:57:...|214577561| 0|
| 2|2014-04-07 13:56:...|214662742| 0|
| 2|2014-04-07 13:57:...|214662742| 0|
| 2|2014-04-07 13:58:...|214825110| 0|
| 2|2014-04-07 13:59:...|214757390| 0|
| 2|2014-04-07 14:00:...|214757407| 0|
| 2|2014-04-07 14:02:...|214551617| 0|
| 3|2014-04-02 13:17:...|214716935| 0|
| 3|2014-04-02 13:26:...|214774687| 0|
| 3|2014-04-02 13:30:...|214832672| 0|
| 4|2014-04-07 12:09:...|214836765| 0|
| 4|2014-04-07 12:26:...|214706482| 0|
| 6|2014-04-06 16:58:...|214701242| 0|
| 6|2014-04-06 17:02:...|214826623| 0|
| 7|2014-04-02 06:38:...|214826835| 0|
| 7|2014-04-02 06:39:...|214826715| 0|
| 8|2014-04-06 08:49:...|214838855| 0|
+----------+--------------------+---------+--------+
我想得到当前行的时间戳和前一行的时间戳之间的差异。
因此,我将时间戳转换如下
data = data.withColumn('time_seconds',data.timestamp.astype('Timestamp').cast("long"))
data.show()
接下来,我尝试了以下方法
my_window = Window.partitionBy().orderBy("session_id")
data = data.withColumn("prev_value", F.lag(data.time_seconds).over(my_window))
data = data.withColumn("diff", F.when(F.isnull(data.time_seconds - data.prev_value), 0)
.otherwise(data.time_seconds - data.prev_value))
data.show()
这就是我得到的
+----------+-----------+---------+--------+------------+----------+--------+
|session_id| timestamp| item_id|category|time_seconds|prev_value| diff|
+----------+--------------------+---------+--------+------------+----------+
| 1|2014-04-07 |214536502| 0| 1396831869| null| 0|
| 1|2014-04-07 |214536500| 0| 1396832049|1396831869| 180|
| 1|2014-04-07 |214536506| 0| 1396832086|1396832049| 37|
| 1|2014-04-07 |214577561| 0| 1396832220|1396832086| 134|
| 10000001|2014-09-08 |214854230| S| 1410136538|1396832220|13304318|
| 10000001|2014-09-08 |214556216| S| 1410136820|1410136538| 282|
| 10000001|2014-09-08 |214556212| S| 1410136836|1410136820| 16|
| 10000001|2014-09-08 |214854230| S| 1410136872|1410136836| 36|
| 10000001|2014-09-08 |214854125| S| 1410137314|1410136872| 442|
| 10000002|2014-09-08 |214849322| S| 1410167451|1410137314| 30137|
| 10000002|2014-09-08 |214838094| S| 1410167611|1410167451| 160|
| 10000002|2014-09-08 |214714721| S| 1410167694|1410167611| 83|
| 10000002|2014-09-08 |214853711| S| 1410168818|1410167694| 1124|
| 10000003|2014-09-05 |214853090| 3| 1409880735|1410168818| -288083|
| 10000003|2014-09-05 |214851326| 3| 1409880865|1409880735| 130|
| 10000003|2014-09-05 |214853094| 3| 1409881043|1409880865| 178|
| 10000004|2014-09-05 |214853090| 3| 1409886885|1409881043| 5842|
| 10000004|2014-09-05 |214851326| 3| 1409889318|1409886885| 2433|
| 10000004|2014-09-05 |214853090| 3| 1409889388|1409889318| 70|
| 10000004|2014-09-05 |214851326| 3| 1409889428|1409889388| 40|
+----------+--------------------+---------+--------+------------+----------+
only showing top 20 rows
我希望会话Id是按数字顺序出现的,而不是按数字顺序出现的
是否有办法使会话id以数字顺序(如1,2,3…)而不是(1100001…)出现
多谢各位
我尝试更改partition by和order by的值,但它使您无法选择partition by session_id还是orderBy session_id?orderBy session_id