Apache spark Pyspark:在windows中缝合多个事件行
我试图根据数据帧中的几个事件行之间的时差将它们缝合在一起。我在dataframe中创建了一个新列,它使用lag表示与前一行的时间差。数据框如下所示:Apache spark Pyspark:在windows中缝合多个事件行,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我试图根据数据帧中的几个事件行之间的时差将它们缝合在一起。我在dataframe中创建了一个新列,它使用lag表示与前一行的时间差。数据框如下所示: sc=spark.sparkContext df = spark.createDataFrame( sc.parallelize( [['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
sc=spark.sparkContext
df = spark.createDataFrame(
sc.parallelize(
[['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
),
['id',"row_number", "time_diff"]
)
from pyspark.sql.functions import when,col
window = Window.partitionBy('id').orderBy('row_number')
df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
.otherwise(f.lag(col('new_row_number')).over(window)))
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 2|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
如果与上一个事件的时间差小于160,我想缝合行。
为此,我计划将新的行号分配给彼此相隔160时间内的所有事件,然后在新的行号上使用groupby
对于上述数据帧,我希望输出为:
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 1|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
我写了一个程序如下:
sc=spark.sparkContext
df = spark.createDataFrame(
sc.parallelize(
[['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
),
['id',"row_number", "time_diff"]
)
from pyspark.sql.functions import when,col
window = Window.partitionBy('id').orderBy('row_number')
df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
.otherwise(f.lag(col('new_row_number')).over(window)))
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 2|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
但我得到的结果如下:
sc=spark.sparkContext
df = spark.createDataFrame(
sc.parallelize(
[['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
),
['id',"row_number", "time_diff"]
)
from pyspark.sql.functions import when,col
window = Window.partitionBy('id').orderBy('row_number')
df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
.otherwise(f.lag(col('new_row_number')).over(window)))
+------+----------+---------+--------------+
|id. |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
| x| 1 | 9999| 1|
| x| 2 | 120| 1|
| x| 3 | 102| 2|
| x| 4 | 3000| 4|
| x| 5 | 299| 5|
| x| 6 | 100| 5|
+------+----------+---------+--------------+
有人能帮我解决这个问题吗?
谢谢,因此您需要当前正在填充的列的上一个值,这是不可能的,因此为了实现这一点,我们可以执行以下操作:
window = Window.partitionBy('id').orderBy('row_number')
df3=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))\
.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
解释:
首先,我们为每一个大于空的行生成行值
df2=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))
df2.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| null|
| x| 3| 102| null|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| null|
+---+----------+---------+--------------+
然后我们用最后一个值填充数据帧
df3=df2.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
df3.show()
+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
| x| 1| 9999| 1|
| x| 2| 120| 1|
| x| 3| 102| 1|
| x| 4| 3000| 4|
| x| 5| 299| 5|
| x| 6| 100| 5|
+---+----------+---------+--------------+
希望它能解决您的问题。非常感谢。这有帮助