Python 使用Pyspark根据条件和大数据计算时差

Python 使用Pyspark根据条件和大数据计算时差,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,请帮忙……我有如下数据: from pyspark.context import SparkContext from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) from pyspark.sql.functions import substring, length dept = [("A",1,"2020-11-07

请帮忙……我有如下数据:

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.sql.functions import substring, length
dept = [("A",1,"2020-11-07 23:19:12"), ("A",1,"2020-11-07 23:19:16"), ("A",1,"2020-11-07 23:19:56"), ("A",0,"2020-11-07 23:20:37"), ("A",0,"2020-11-07 23:21:06"), ("A",0,"2020-11-07 23:21:47"), ("A",1,"2020-11-07 23:22:05"), ("A",1,"2020-11-07 23:22:30"),("A",1,"2020-11-07 23:23:00"), ("B",1,"2020-11-07 22:19:12"), ("B",1,"2020-11-07 22:20:10"), ("B",0,"2020-11-07 22:21:31"), ("B",0,"2020-11-07 22:22:01"), ("B",0,"2020-11-07 22:22:45"), ("B",1,"2020-11-07 22:23:52"), ("B",1,"2020-11-07 22:24:10")]
deptColumns = ["Id","BAP","Time"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.show()

  • 使用Pyspark,如何为每个ID获取第一系列零中的第一个0的时间,并在同一系列零之后获取第一个1的时间。然后在两者之间进行时间戳减法。大概是这样的:
  • 必须对构成每个ID的每个零序列执行此操作。因此,如果有多个零序列,则可以对同一ID使用多个DeltaTime

    实际上,我可以计算连续行之间的增量时间:

    Delta=deptDF.withColumn("DeltaTime",(deptDF.Time.cast("bigint") - lag(deptDF.Time.cast("bigint"),1).over(Window.partitionBy("Id").orderBy("Time")).cast("bigint")))
    Delta.show()
    

    是否可以添加任何条件以获得预期结果?

    添加两列begin0和begin1,以帮助使用窗口函数解析数据:

    import pyspark.sql.functions as F
    
    window = Window.partitionBy('Id').orderBy('Time')
    Delta = deptDF.withColumn(
        'begin0',
        (F.lag('BAP').over(window) != 0) & (F.col('BAP') == 0)
    ).withColumn(
        'begin1',
        (F.lag('BAP').over(window) == 0) & (F.col('BAP') == 1)
    ).filter(
        'begin0 or begin1'
    ).withColumn(
        'DeltaTime',
        F.when(
            F.col('BAP') == 0,
            F.date_format(
                (
                    F.lead('Time').over(window).cast('timestamp').cast('bigint') -
                    F.col('Time').cast('timestamp').cast('bigint')
                ).cast('timestamp'),
               'HH:mm:ss'
           )
        ).otherwise(
            F.lit('00:00:00')
        )
    ).drop(
        'begin0', 'begin1'
    ).orderBy(
        'Id','Time'
    )
    
    Delta.show()
    +---+---+-------------------+---------+
    | Id|BAP|               Time|DeltaTime|
    +---+---+-------------------+---------+
    |  A|  0|2020-11-07 23:20:37| 00:01:28|
    |  A|  1|2020-11-07 23:22:05| 00:00:00|
    |  B|  0|2020-11-07 22:21:31| 00:02:21|
    |  B|  1|2020-11-07 22:23:52| 00:00:00|
    +---+---+-------------------+---------+
    

    欢迎来到SO。请展示你的努力。所以这不是一个工作卸载系统。后期编辑。。检查新版本