pySpark windows分区sortby而不是order by(感叹号)

pySpark windows分区sortby而不是order by(感叹号),pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,这是我当前的数据集 +----------+--------------------+---------+--------+ |session_id| timestamp| item_id|category| +----------+--------------------+---------+--------+ | 1|2014-04-07 10:51:...|214536502| 0| | 1|2014-04-07 10:54

这是我当前的数据集

+----------+--------------------+---------+--------+
|session_id|           timestamp|  item_id|category|
+----------+--------------------+---------+--------+
|         1|2014-04-07 10:51:...|214536502|       0|
|         1|2014-04-07 10:54:...|214536500|       0|
|         1|2014-04-07 10:54:...|214536506|       0|
|         1|2014-04-07 10:57:...|214577561|       0|
|         2|2014-04-07 13:56:...|214662742|       0|
|         2|2014-04-07 13:57:...|214662742|       0|
|         2|2014-04-07 13:58:...|214825110|       0|
|         2|2014-04-07 13:59:...|214757390|       0|
|         2|2014-04-07 14:00:...|214757407|       0|
|         2|2014-04-07 14:02:...|214551617|       0|
|         3|2014-04-02 13:17:...|214716935|       0|
|         3|2014-04-02 13:26:...|214774687|       0|
|         3|2014-04-02 13:30:...|214832672|       0|
|         4|2014-04-07 12:09:...|214836765|       0|
|         4|2014-04-07 12:26:...|214706482|       0|
|         6|2014-04-06 16:58:...|214701242|       0|
|         6|2014-04-06 17:02:...|214826623|       0|
|         7|2014-04-02 06:38:...|214826835|       0|
|         7|2014-04-02 06:39:...|214826715|       0|
|         8|2014-04-06 08:49:...|214838855|       0|
+----------+--------------------+---------+--------+
我想得到当前行的时间戳和前一行的时间戳之间的差异。 因此,我将时间戳转换如下

data = data.withColumn('time_seconds',data.timestamp.astype('Timestamp').cast("long"))
data.show()
接下来,我尝试了以下方法

my_window = Window.partitionBy().orderBy("session_id")

data = data.withColumn("prev_value", F.lag(data.time_seconds).over(my_window))
data = data.withColumn("diff", F.when(F.isnull(data.time_seconds - data.prev_value), 0)
                              .otherwise(data.time_seconds - data.prev_value))

data.show()
这就是我得到的

+----------+-----------+---------+--------+------------+----------+--------+
|session_id|  timestamp|  item_id|category|time_seconds|prev_value|    diff|
+----------+--------------------+---------+--------+------------+----------+
|         1|2014-04-07 |214536502|       0|  1396831869|      null|       0|
|         1|2014-04-07 |214536500|       0|  1396832049|1396831869|     180|
|         1|2014-04-07 |214536506|       0|  1396832086|1396832049|      37|
|         1|2014-04-07 |214577561|       0|  1396832220|1396832086|     134|
|  10000001|2014-09-08 |214854230|       S|  1410136538|1396832220|13304318|
|  10000001|2014-09-08 |214556216|       S|  1410136820|1410136538|     282|
|  10000001|2014-09-08 |214556212|       S|  1410136836|1410136820|      16|
|  10000001|2014-09-08 |214854230|       S|  1410136872|1410136836|      36|
|  10000001|2014-09-08 |214854125|       S|  1410137314|1410136872|     442|
|  10000002|2014-09-08 |214849322|       S|  1410167451|1410137314|   30137|
|  10000002|2014-09-08 |214838094|       S|  1410167611|1410167451|     160|
|  10000002|2014-09-08 |214714721|       S|  1410167694|1410167611|      83|
|  10000002|2014-09-08 |214853711|       S|  1410168818|1410167694|    1124|
|  10000003|2014-09-05 |214853090|       3|  1409880735|1410168818| -288083|
|  10000003|2014-09-05 |214851326|       3|  1409880865|1409880735|     130|
|  10000003|2014-09-05 |214853094|       3|  1409881043|1409880865|     178|
|  10000004|2014-09-05 |214853090|       3|  1409886885|1409881043|    5842|
|  10000004|2014-09-05 |214851326|       3|  1409889318|1409886885|    2433|
|  10000004|2014-09-05 |214853090|       3|  1409889388|1409889318|      70|
|  10000004|2014-09-05 |214851326|       3|  1409889428|1409889388|      40|
+----------+--------------------+---------+--------+------------+----------+
only showing top 20 rows
我希望会话Id是按数字顺序出现的,而不是按数字顺序出现的

是否有办法使会话id以数字顺序(如1,2,3…)而不是(1100001…)出现 多谢各位

我尝试更改partition by和order by的值,但它使您无法选择partition by session_id还是orderBy session_id?orderBy session_id