SQL或Pyspark-获取上一次列对每个ID具有不同值的时间_Sql_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql

SQL或Pyspark-获取上一次列对每个ID具有不同值的时间

sql apache-spark pyspark

SQL或Pyspark-获取上一次列对每个ID具有不同值的时间,sql,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我正在使用pyspark，所以我尝试了pyspark代码和SQL 我试图获取地址列为不同值的时间，按用户ID分组。行按时间排序。以下表为例： +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A

我正在使用pyspark，所以我尝试了pyspark代码和SQL

我试图获取地址列为不同值的时间，按用户ID分组。行按时间排序。以下表为例：

    +---+-------+-------+----+
    | ID|USER_ID|ADDRESS|TIME|
    +---+-------+-------+----+
    |  1|      1|      A|  10|
    |  2|      1|      B|  15|
    |  3|      1|      A|  20|
    |  4|      1|      A|  40|
    |  5|      1|      A|  45|
    +---+-------+-------+----+

我想要的正确的新专栏如下：

    +---+-------+-------+----+---------+
    | ID|USER_ID|ADDRESS|TIME|LAST_DIFF|
    +---+-------+-------+----+---------+
    |  1|      1|      A|  10|     null|
    |  2|      1|      B|  15|       10|
    |  3|      1|      A|  20|       15|
    |  4|      1|      A|  40|       15|
    |  5|      1|      A|  45|       15|
    +---+-------+-------+----+---------+

我尝试过使用不同的窗口，但似乎没有一个能完全满足我的需求。有什么想法吗？

使用两个窗口规格的单向：

from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window

w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')

# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))

# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))

df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+                               
|USER_ID|  g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
|      1|  1|  1|      A|  10|  10|     null|
|      1|  2|  2|      B|  15|  15|       10|
|      1|  3|  3|      A|  20| 105|       15|
|      1|  3|  4|      A|  40| 105|       15|
|      1|  3|  5|      A|  45| 105|       15|
+-------+---+---+-------+----+----+---------+

df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

使用两个窗规格的单向：

from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window

w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')

# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))

# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))

df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+                               
|USER_ID|  g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
|      1|  1|  1|      A|  10|  10|     null|
|      1|  2|  2|      B|  15|  15|       10|
|      1|  3|  3|      A|  20| 105|       15|
|      1|  3|  4|      A|  40| 105|       15|
|      1|  3|  5|      A|  45| 105|       15|
+-------+---+---+-------+----+----+---------+

df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

@jxc答案的简化版本

from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
           .withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()

使用
```
lag
```
并运行
```
sum
```
在列值（基于定义的窗口）发生更改时分配组。从上一行获取时间，将在下一步中使用
获取组后，使用运行的
```
min
```
imum获取列值更改的最后时间戳。（建议您查看中间结果，以便更好地理解转换）

是@jxc答案的简化版本

from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
           .withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()

使用
```
lag
```
并运行
```
sum
```
在列值（基于定义的窗口）发生更改时分配组。从上一行获取时间，将在下一步中使用
获取组后，使用运行的
```
min
```
imum获取列值更改的最后时间戳。（建议您查看中间结果，以便更好地理解转换）

哪一列？如何指定行顺序？编辑问题，查看上次地址列与哪列不同？如何指定行顺序？编辑问题，查看上次地址列的不同