Pyspark 派斯帕克疾病控制中心

Pyspark 派斯帕克疾病控制中心,pyspark,Pyspark,我正在尝试编写pyspark代码以适应两种场景 情景1: 输入数据: col1|col2|date 100|Austin|2021-01-10 100|Newyork|2021-02-15 100|Austin|2021-03-02 CDC的预期产出: col1|col2|start_date|end_date 100|Austin|2021-01-10|2021-02-15 100|Newyork|2021-02-15|2021-03-02 100|Austin|2021-03-02|209

我正在尝试编写pyspark代码以适应两种场景

情景1:

输入数据:

col1|col2|date
100|Austin|2021-01-10
100|Newyork|2021-02-15
100|Austin|2021-03-02
CDC的预期产出:

col1|col2|start_date|end_date
100|Austin|2021-01-10|2021-02-15
100|Newyork|2021-02-15|2021-03-02
100|Austin|2021-03-02|2099-12-31
按顺序,col2值发生变化,并且希望保持CDC

情景2:

输入:

col1|col2|date
100|Austin|2021-01-10
100|Austin|2021-03-02  -> I want to eliminate this version because there is no change in col1 and col2 values between records. 
预期产出:

 col1|col2|start_date|end_date
 100|Austin|2021-01-10|2099-12-31
我希望在这两种情况下使用相同的代码

我正在尝试类似的方法,但不适用于这两种情况

      inputdf = inputdf.groupBy('col1','col2','date').agg(
      F.min("date").alias("r_date"))
    inputdf = inputdf.drop("date").withColumnRenamed("r_date", "start_date")
    my_allcolumnwindowasc = Window.partitionBy('col1','col2').orderBy("start_date")
    inputdf = inputdf.withColumn('dropDuplicates',F.lead(inputdf.start_date).over(my_allcolumnwindowasc)).where(F.col("dropDuplicates").isNotNull()).drop('dropDuplicates')
在一些场景中有20多列。 谢谢你的帮助

看看这个

步骤:

  • 也可以使用窗口函数给出行号
  • 将数据帧转换为视图
  • 使用自连接(条件检查是关键)
  • 如果为空值,则使用coalesce包装的Lead window函数给出“2099-12-31”值
  • from pyspark.sql import SparkSession
    from pyspark.sql import functions as F
    from pyspark.sql.window import Window
    
    spark = SparkSession \
        .builder \
        .appName("SO") \
        .getOrCreate()
    
    df = spark.createDataFrame(
        [(100, "Austin", "2021-01-10"),
    (100, "Newyork", "2021-02-15"),
    (100, "Austin", "2021-03-02"),
        ],
        ['col1', 'col2', 'date']
    )
    
    # df = spark.createDataFrame(
    #     [(100, "Austin", "2021-01-10"),
    # (100, "Austin", "2021-03-02"),
    #     ],
    #     ['col1', 'col2', 'date']
    # )
    
    df1 = df.withColumn("start_date", F.to_date("date"))
    
    w = Window.partitionBy("col1",).orderBy("start_date")
    
    df_1 = df1.withColumn("rn", F.row_number().over(w))
    
    df_1.createTempView("temp_1")
    
    df_dupe = spark.sql('select temp_1.col1,temp_1.col2,temp_1.start_date, case when temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 then "delete" else "no-delete" end as dupe  from temp_1 left join temp_1 as temp_2 '
                        'on temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 and temp_1.rn-1 = temp_2.rn order by temp_1.start_date  ')
    
    df_dupe.filter(F.col("dupe")=="no-delete").drop("dupe")\
        .withColumn("end_date", F.coalesce(F.lead("start_date").over(w),F.lit("2099-12-31"))).show()
    
    
    # Result:
    # Scenario1:
    #+----+-------+----------+----------+
    # |col1|   col2|start_date|  end_date|
    # +----+-------+----------+----------+
    # | 100| Austin|2021-01-10|2021-02-15|
    # | 100|Newyork|2021-02-15|2021-03-02|
    # | 100| Austin|2021-03-02|2099-12-31|
    # +----+-------+----------+----------+
    #
    # Scenario 2:
    # +----+------+----------+----------+
    # |col1|  col2|start_date|  end_date|
    # +----+------+----------+----------+
    # | 100|Austin|2021-01-10|2099-12-31|
    # +----+------+----------+----------+