Python 计算跨数据帧的重叠时间间隔

Python 计算跨数据帧的重叠时间间隔,python,pyspark,Python,Pyspark,我试图计算哪些行在时间上与数据帧上的任何其他行重叠。举个例子: df = spark.createDataFrame( [ [1,"2020:01:01 12:00", "2020:01:01 13:00"], # T [1,"2020:01:01 12:30", "2020:01:01 13:00"],

我试图计算哪些行在时间上与
数据帧上的任何其他行重叠。举个例子:

            df = spark.createDataFrame(
            [
                [1,"2020:01:01 12:00", "2020:01:01 13:00"], # T
                [1,"2020:01:01 12:30", "2020:01:01 13:00"], # T
                [1,"2020:01:01 14:00", "2020:01:01 15:00"], # F
                [2,"2020:01:01 09:00", "2020:01:01 13:00"], # F
                [2,"2020:01:01 18:00", "2020:01:01 19:00"], # F
            ],
            ["id", "start", "end"]
        )
        w = Window.partitionBy("id").orderBy("start").rowsBetween(Window.unboundedPreceding, 0)
        df.withColumn("OVERLAPS", ?????.over(w)).show()
我想生成一列,每当具有相同
id
的两行在时间上重叠时发出信号:

                [1,"2020:01:01 12:00", "2020:01:01 13:00", True], 
                [1,"2020:01:01 12:30", "2020:01:01 13:00", True], 
                [1,"2020:01:01 14:00", "2020:01:01 15:00", False], 
                [2,"2020:01:01 09:00", "2020:01:01 13:00", False], 
                [2,"2020:01:01 18:00", "2020:01:01 19:00", False],
理想情况下,我可以创建时间间隔,在pandas中,它将使用
pd.IntervalIndex
,我可以查询时间间隔,例如[2020:01:01 12:30,2020:01:01 14:00]中的
“2020:01:01 13:00”


但我一直在努力在spark上实现它。

所以我尝试了一下,在scala中,我的解决方案很难看,可能会改进(整个专栏的重命名等).我做了一个假设,就是时间已经被解析成一些整数格式,然后我用一个稍微复杂一点的连接基本上解决了这个问题:

val s = Seq(
    (1, 1200, 1300),
    (1, 1230, 1300),
    (1, 1400, 1500),
    (2, 900, 1300),
    (2, 1800, 1900)
)

val df = s.toDF("id", "start", "end").withColumn("i", monotonicallyIncreasingId)
val dfr = df.select($"id", $"start".alias("start_r"), $"end".alias("end_r"), $"i".alias("i_r")) 

val overlaps = df.join(
    dfr,
    df("id") === dfr("id") 
        and df("start") <= dfr("end_r") 
        and df("end") >= dfr("start_r") 
        and (df("i") !== dfr("i_r"))
    ).select($"i".alias("i_overlaps"), lit(true).alias("overlaps"))

val result = df.join(overlaps, df("i") === overlaps("i_overlaps"), "left").
        drop("i_overlaps").
        withColumn("overlaps", when($"overlaps".isNull, false).otherwise($"overlaps"))

result.show
您可以再次删除
i
列。在此解决方案中,也不需要重新分区或对数据帧排序,因为这将在联接中处理。不过,这种联接可能不是最有效的:)

+---+-----+----+---+--------+                                                   
| id|start| end|  i|overlaps|
+---+-----+----+---+--------+
|  1| 1200|1300|  0|    true|
|  1| 1230|1300|  1|    true|
|  2|  900|1300|  3|   false|
|  1| 1400|1500|  2|   false|
|  2| 1800|1900|  4|   false|
+---+-----+----+---+--------+