Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Pyspark:在windows中缝合多个事件行_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark Pyspark:在windows中缝合多个事件行

Apache spark Pyspark:在windows中缝合多个事件行,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我试图根据数据帧中的几个事件行之间的时差将它们缝合在一起。我在dataframe中创建了一个新列,它使用lag表示与前一行的时间差。数据框如下所示: sc=spark.sparkContext df = spark.createDataFrame( sc.parallelize( [['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]

我试图根据数据帧中的几个事件行之间的时差将它们缝合在一起。我在dataframe中创建了一个新列,它使用lag表示与前一行的时间差。数据框如下所示:

sc=spark.sparkContext
df = spark.createDataFrame(
    sc.parallelize(
        [['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
    ), 
    ['id',"row_number", "time_diff"]
)
from pyspark.sql.functions import when,col

window = Window.partitionBy('id').orderBy('row_number')

df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
                       .otherwise(f.lag(col('new_row_number')).over(window)))
+------+----------+---------+--------------+
|id.   |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
|     x|  1       |     9999|             1|
|     x|  2       |      120|             1|
|     x|  3       |      102|             2|
|     x|  4       |     3000|             4|
|     x|  5       |      299|             5|
|     x|  6       |      100|             5|
+------+----------+---------+--------------+
如果与上一个事件的时间差小于160,我想缝合行。 为此,我计划将新的行号分配给彼此相隔160时间内的所有事件,然后在新的行号上使用groupby

对于上述数据帧,我希望输出为:

   +------+----------+---------+--------------+
    |id.   |row_number|time_diff|new_row_number|
    +------+----------+---------+--------------+
    |     x|  1       |     9999|             1|
    |     x|  2       |      120|             1|
    |     x|  3       |      102|             1|
    |     x|  4       |     3000|             4|
    |     x|  5       |      299|             5|
    |     x|  6       |      100|             5|
    +------+----------+---------+--------------+
我写了一个程序如下:

sc=spark.sparkContext
df = spark.createDataFrame(
    sc.parallelize(
        [['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
    ), 
    ['id',"row_number", "time_diff"]
)
from pyspark.sql.functions import when,col

window = Window.partitionBy('id').orderBy('row_number')

df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
                       .otherwise(f.lag(col('new_row_number')).over(window)))
+------+----------+---------+--------------+
|id.   |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
|     x|  1       |     9999|             1|
|     x|  2       |      120|             1|
|     x|  3       |      102|             2|
|     x|  4       |     3000|             4|
|     x|  5       |      299|             5|
|     x|  6       |      100|             5|
+------+----------+---------+--------------+
但我得到的结果如下:

sc=spark.sparkContext
df = spark.createDataFrame(
    sc.parallelize(
        [['x',1, "9999"], ['x',2, "120"], ['x',3, "102"], ['x',4, "3000"],['x',5, "299"],['x',6, "100"]]
    ), 
    ['id',"row_number", "time_diff"]
)
from pyspark.sql.functions import when,col

window = Window.partitionBy('id').orderBy('row_number')

df2=df.withColumn('new_row_number', col('id'))
df3=df2.withColumn('new_row_number', when(col('time_diff')>=160, col('id'))\
                       .otherwise(f.lag(col('new_row_number')).over(window)))
+------+----------+---------+--------------+
|id.   |row_number|time_diff|new_row_number|
+------+----------+---------+--------------+
|     x|  1       |     9999|             1|
|     x|  2       |      120|             1|
|     x|  3       |      102|             2|
|     x|  4       |     3000|             4|
|     x|  5       |      299|             5|
|     x|  6       |      100|             5|
+------+----------+---------+--------------+
有人能帮我解决这个问题吗?
谢谢,因此您需要当前正在填充的列的上一个值,这是不可能的,因此为了实现这一点,我们可以执行以下操作:

window = Window.partitionBy('id').orderBy('row_number')
df3=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))\
      .withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))

+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
|  x|         1|     9999|             1|
|  x|         2|      120|             1|
|  x|         3|      102|             1|
|  x|         4|     3000|             4|
|  x|         5|      299|             5|
|  x|         6|      100|             5|
+---+----------+---------+--------------+
解释:

首先,我们为每一个大于空的行生成行值

df2=df.withColumn('new_row_number', f.when(f.col('time_diff')>=160, f.col('row_number')))
df2.show()

+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
|  x|         1|     9999|             1|
|  x|         2|      120|          null|
|  x|         3|      102|          null|
|  x|         4|     3000|             4|
|  x|         5|      299|             5|
|  x|         6|      100|          null|
+---+----------+---------+--------------+
然后我们用最后一个值填充数据帧

df3=df2.withColumn("new_row_number", f.last(f.col("new_row_number"), ignorenulls=True).over(window))
df3.show()

+---+----------+---------+--------------+
| id|row_number|time_diff|new_row_number|
+---+----------+---------+--------------+
|  x|         1|     9999|             1|
|  x|         2|      120|             1|
|  x|         3|      102|             1|
|  x|         4|     3000|             4|
|  x|         5|      299|             5|
|  x|         6|      100|             5|
+---+----------+---------+--------------+

希望它能解决您的问题。

非常感谢。这有帮助