Pyspark 派斯帕克疾病控制中心
我正在尝试编写pyspark代码以适应两种场景 情景1: 输入数据:Pyspark 派斯帕克疾病控制中心,pyspark,Pyspark,我正在尝试编写pyspark代码以适应两种场景 情景1: 输入数据: col1|col2|date 100|Austin|2021-01-10 100|Newyork|2021-02-15 100|Austin|2021-03-02 CDC的预期产出: col1|col2|start_date|end_date 100|Austin|2021-01-10|2021-02-15 100|Newyork|2021-02-15|2021-03-02 100|Austin|2021-03-02|209
col1|col2|date
100|Austin|2021-01-10
100|Newyork|2021-02-15
100|Austin|2021-03-02
CDC的预期产出:
col1|col2|start_date|end_date
100|Austin|2021-01-10|2021-02-15
100|Newyork|2021-02-15|2021-03-02
100|Austin|2021-03-02|2099-12-31
按顺序,col2值发生变化,并且希望保持CDC
情景2:
输入:
col1|col2|date
100|Austin|2021-01-10
100|Austin|2021-03-02 -> I want to eliminate this version because there is no change in col1 and col2 values between records.
预期产出:
col1|col2|start_date|end_date
100|Austin|2021-01-10|2099-12-31
我希望在这两种情况下使用相同的代码
我正在尝试类似的方法,但不适用于这两种情况
inputdf = inputdf.groupBy('col1','col2','date').agg(
F.min("date").alias("r_date"))
inputdf = inputdf.drop("date").withColumnRenamed("r_date", "start_date")
my_allcolumnwindowasc = Window.partitionBy('col1','col2').orderBy("start_date")
inputdf = inputdf.withColumn('dropDuplicates',F.lead(inputdf.start_date).over(my_allcolumnwindowasc)).where(F.col("dropDuplicates").isNotNull()).drop('dropDuplicates')
在一些场景中有20多列。
谢谢你的帮助 看看这个
步骤:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession \
.builder \
.appName("SO") \
.getOrCreate()
df = spark.createDataFrame(
[(100, "Austin", "2021-01-10"),
(100, "Newyork", "2021-02-15"),
(100, "Austin", "2021-03-02"),
],
['col1', 'col2', 'date']
)
# df = spark.createDataFrame(
# [(100, "Austin", "2021-01-10"),
# (100, "Austin", "2021-03-02"),
# ],
# ['col1', 'col2', 'date']
# )
df1 = df.withColumn("start_date", F.to_date("date"))
w = Window.partitionBy("col1",).orderBy("start_date")
df_1 = df1.withColumn("rn", F.row_number().over(w))
df_1.createTempView("temp_1")
df_dupe = spark.sql('select temp_1.col1,temp_1.col2,temp_1.start_date, case when temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 then "delete" else "no-delete" end as dupe from temp_1 left join temp_1 as temp_2 '
'on temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 and temp_1.rn-1 = temp_2.rn order by temp_1.start_date ')
df_dupe.filter(F.col("dupe")=="no-delete").drop("dupe")\
.withColumn("end_date", F.coalesce(F.lead("start_date").over(w),F.lit("2099-12-31"))).show()
# Result:
# Scenario1:
#+----+-------+----------+----------+
# |col1| col2|start_date| end_date|
# +----+-------+----------+----------+
# | 100| Austin|2021-01-10|2021-02-15|
# | 100|Newyork|2021-02-15|2021-03-02|
# | 100| Austin|2021-03-02|2099-12-31|
# +----+-------+----------+----------+
#
# Scenario 2:
# +----+------+----------+----------+
# |col1| col2|start_date| end_date|
# +----+------+----------+----------+
# | 100|Austin|2021-01-10|2099-12-31|
# +----+------+----------+----------+