SQL或Pyspark-获取上一次列对每个ID具有不同值的时间

SQL或Pyspark-获取上一次列对每个ID具有不同值的时间,sql,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我正在使用pyspark,所以我尝试了pyspark代码和SQL 我试图获取地址列为不同值的时间,按用户ID分组。行按时间排序。以下表为例: +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A

我正在使用pyspark,所以我尝试了pyspark代码和SQL

我试图获取地址列为不同值的时间,按用户ID分组。行按时间排序。以下表为例:

    +---+-------+-------+----+
    | ID|USER_ID|ADDRESS|TIME|
    +---+-------+-------+----+
    |  1|      1|      A|  10|
    |  2|      1|      B|  15|
    |  3|      1|      A|  20|
    |  4|      1|      A|  40|
    |  5|      1|      A|  45|
    +---+-------+-------+----+
我想要的正确的新专栏如下:

    +---+-------+-------+----+---------+
    | ID|USER_ID|ADDRESS|TIME|LAST_DIFF|
    +---+-------+-------+----+---------+
    |  1|      1|      A|  10|     null|
    |  2|      1|      B|  15|       10|
    |  3|      1|      A|  20|       15|
    |  4|      1|      A|  40|       15|
    |  5|      1|      A|  45|       15|
    +---+-------+-------+----+---------+

我尝试过使用不同的窗口,但似乎没有一个能完全满足我的需求。有什么想法吗?

使用两个窗口规格的单向:

from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window

w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')

# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))

# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))

df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+                               
|USER_ID|  g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
|      1|  1|  1|      A|  10|  10|     null|
|      1|  2|  2|      B|  15|  15|       10|
|      1|  3|  3|      A|  20| 105|       15|
|      1|  3|  4|      A|  40| 105|       15|
|      1|  3|  5|      A|  45| 105|       15|
+-------+---+---+-------+----+----+---------+

df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

使用两个窗规格的单向:

from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window

w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')

# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))

# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))

df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+                               
|USER_ID|  g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
|      1|  1|  1|      A|  10|  10|     null|
|      1|  2|  2|      B|  15|  15|       10|
|      1|  3|  3|      A|  20| 105|       15|
|      1|  3|  4|      A|  40| 105|       15|
|      1|  3|  5|      A|  45| 105|       15|
+-------+---+---+-------+----+----+---------+

df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

@jxc答案的简化版本

from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
           .withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()
  • 使用
    lag
    并运行
    sum
    在列值(基于定义的窗口)发生更改时分配组。从上一行获取时间,将在下一步中使用
  • 获取组后,使用运行的
    min
    imum获取列值更改的最后时间戳。(建议您查看中间结果,以便更好地理解转换)

是@jxc答案的简化版本

from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
           .withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()
  • 使用
    lag
    并运行
    sum
    在列值(基于定义的窗口)发生更改时分配组。从上一行获取时间,将在下一步中使用
  • 获取组后,使用运行的
    min
    imum获取列值更改的最后时间戳。(建议您查看中间结果,以便更好地理解转换)

哪一列?如何指定行顺序?编辑问题,查看上次地址列与哪列不同?如何指定行顺序?编辑问题,查看上次地址列的不同