Pyspark 生成列的值,直到下一个值到达配置单元表

Pyspark 生成列的值,直到下一个值到达配置单元表,pyspark,hive,Pyspark,Hive,我有两个数据集,如下所示 数据集1 另一个,, 2.数据集2 cust_id chg_date ins_status 985XFT82Y4 2020-08-24 22:12:34.332000 subscribed 985XFT82Y4 2020-11-11 14:45:31.152000 installed 985XFT82Y4 2021-02-02 01:26:34.500000 migration 985XF

我有两个数据集,如下所示

  • 数据集1
  • 另一个,, 2.数据集2

    cust_id     chg_date                       ins_status
    985XFT82Y4  2020-08-24 22:12:34.332000     subscribed
    985XFT82Y4  2020-11-11 14:45:31.152000     installed
    985XFT82Y4  2021-02-02 01:26:34.500000     migration
    985XFT82Y4  2021-03-09 08:11:57.790000     setup done
    
    不,我需要连接这两个数据集,并生成一个数据集数据集\u结果,该数据集应具有cust\u id、pt\u dt、ins\u status字段 连接应在客户id和pt_dt/chg_日期完成。结果应该如下所示

    cust_id     pt_dt           ins_status
    985XFT82Y4  20200824        subscribed
    985XFT82Y4  20200826        subscribed
    985XFT82Y4  20200902        subscribed
    985XFT82Y4  20200918        subscribed
    985XFT82Y4  20200930        subscribed
    985XFT82Y4  20201016        subscribed
    985XFT82Y4  20201021        subscribed
    985XFT82Y4  20201102        subscribed
    985XFT82Y4  20201111        installed
    985XFT82Y4  20201112        installed
    985XFT82Y4  20201208        installed
    985XFT82Y4  20210111        installed
    985XFT82Y4  20210202        migration
    985XFT82Y4  20210303        migration
    985XFT82Y4  20210309        setup done
    985XFT82Y4  20210311        setup done
    
    我曾尝试如下连接这两个数据集,但未能实现

    select a.cust_id, a.pt_dt, b.ins_status
    from dataset1 a 
    left join dataset2 b
    on (a.cust_id = b.cust_id)
    and (a.pt_dt = regexp_replace(substr(b.chg_date,1,10), '-', ''))
    
    有人能建议我在Pypark或hive中做这件事的最佳方法吗

    谢谢

    步骤如下:

    • string-->timestamp-->to_date
    • 在按日期排序的ID分组的窗口规范上应用lead()函数
    • 如何处理第一行和最后一行中的“无”值?用今天的日期填充它们,然后执行联接并从每个数据框中选择相关列
    在线代码@

    导入pyspark.sql.F函数
    从pyspark.sql导入窗口
    从日期时间导入日期时间
    数据=[(“985XFT82Y4”,“20200824”),
    (“985XFT82Y4”、“20200826”),
    (“985XFT82Y4”、“20200902”),
    (“985XFT82Y4”、“20200918”),
    (“985XFT82Y4”、“20200930”),
    (“985XFT82Y4”、“20201016”),
    (“985XFT82Y4”、“20201021”),
    (“985XFT82Y4”、“20201102”),
    (“985XFT82Y4”、“20201111”),
    (“985XFT82Y4”、“20201112”),
    (“985XFT82Y4”、“20201208”),
    (“985XFT82Y4”、“20210111”),
    (“985XFT82Y4”、“20210202”),
    (“985XFT82Y4”、“20210303”),
    (“985XFT82Y4”、“20210309”),
    (“985XFT82Y4”、“20210311”)]
    df1=spark.createDataFrame(数据,[“cust_id”,“pt_dt”)。带列(“pt_dt”,F.to_时间戳(“pt_dt”,“yyyyymmdd”)。带列(“pt_dt”,F.date_格式(F.col('pt_dt'),“yyyyy-MM-dd”))
    df1.show()
    数据1=[(“985XFT82Y4”,“2020-08-24 22:12:34.332000”,“认购”),
    (“985XFT82Y4”,“2020-11-11 14:45:31.152000”,“已安装”),
    (“985XFT82Y4”,“2021-02-02 01:26:34.500000”,“迁移”),
    (“985XFT82Y4”,“2021-03-09 08:11:57.790000”,“设置完成”)]
    ts_pattern=“yyyy-MM-dd HH:MM:ss.ssss”
    df2=spark.createDataFrame(数据1,[“客户id”,“客户日期”,“客户状态])。带列(“客户日期”,F.to_时间戳(“客户日期”,ts_模式))。带列(“客户日期”,F.date格式(F.col(“客户日期”),yyyy-MM-dd)
    df2.show()
    window\u spec=window.partitionBy(“客户id”).orderBy(“更改日期”)
    df2=df2.带列(“结束日期”,F.lead(“结束日期”)。超过(窗口规格))
    df2=df2.带列(“结束日期”),F.when(F.col(“结束日期”).isNull(),F.lit(datetime.now().strftime(“%Y-%m-%d”)),否则(F.col(“结束日期”))
    df2.show()
    +----------+----------+----------+------------+
    |客户识别号|更改日期|登录状态|结束更改日期|
    +----------+----------+----------+------------+
    |985XFT82Y4 | 2020-08-24 |认购| 2021-03-16|
    |985XFT82Y4 | 2020-11-11 |已安装| 2021-02-02|
    |985XFT82Y4 | 2021-02-02 |迁移| 2021-03-09|
    |985XFT82Y4 | 2021-03-09 |设置完成| 2021-03-16|
    +----------+----------+----------+------------+
    条件=[df1[“客户id”]==df2[“客户id”]、df1[“客户id”]>=df2[“客户日期”]、df1[“客户id”]
    步骤如下:

    • string-->timestamp-->to_date
    • 在按日期排序的ID分组的窗口规范上应用lead()函数
    • 如何处理第一行和最后一行中的“无”值?用今天的日期填充它们,然后执行联接并从每个数据框中选择相关列
    在线代码@

    导入pyspark.sql.F函数
    从pyspark.sql导入窗口
    从日期时间导入日期时间
    数据=[(“985XFT82Y4”,“20200824”),
    (“985XFT82Y4”、“20200826”),
    (“985XFT82Y4”、“20200902”),
    (“985XFT82Y4”、“20200918”),
    (“985XFT82Y4”、“20200930”),
    (“985XFT82Y4”、“20201016”),
    (“985XFT82Y4”、“20201021”),
    (“985XFT82Y4”、“20201102”),
    (“985XFT82Y4”、“20201111”),
    (“985XFT82Y4”、“20201112”),
    (“985XFT82Y4”、“20201208”),
    (“985XFT82Y4”、“20210111”),
    (“985XFT82Y4”、“20210202”),
    (“985XFT82Y4”、“20210303”),
    (“985XFT82Y4”、“20210309”),
    (“985XFT82Y4”、“20210311”)]
    df1=spark.createDataFrame(数据,[“cust_id”,“pt_dt”)。带列(“pt_dt”,F.to_时间戳(“pt_dt”,“yyyyymmdd”)。带列(“pt_dt”,F.date_格式(F.col('pt_dt'),“yyyyy-MM-dd”))
    df1.show()
    数据1=[(“985XFT82Y4”,“2020-08-24 22:12:34.332000”,“认购”),
    (“985XFT82Y4”,“2020-11-11 14:45:31.152000”,“已安装”),
    (“985XFT82Y4”,“2021-02-02 01:26:34.500000”,“迁移”),
    (“985XFT82Y4”,“2021-03-09 08:11:57.790000”,“设置完成”)]
    ts_pattern=“yyyy-MM-dd HH:MM:ss.ssss”
    df2=spark.createDataFrame(数据1,[“客户id”,“客户日期”,“客户状态])。带列(“客户日期”,F.to_时间戳(“客户日期”,ts_模式))。带列(“客户日期”,F.date格式(F.col(“客户日期”),yyyy-MM-dd)
    df2.show()
    window\u spec=window.partitionBy(“客户id”).orderBy(“更改日期”)
    df2=df2.带列(“结束日期”,F.lead(“结束日期”)。超过(窗口规格))
    df2=df2.带列(“结束日期”),F.when(F.col(“结束日期”).isNull(),F.lit(datetime.now().strftime(“%Y-%m-%d”)),否则(F.col(“结束日期”))
    df2.show()
    +----------+----------+----------+------------+
    |客户识别号|更改日期|登录状态|结束更改日期|
    +----------+----------+----------+------------+
    |985XF
    
    select a.cust_id, a.pt_dt, b.ins_status
    from dataset1 a 
    left join dataset2 b
    on (a.cust_id = b.cust_id)
    and (a.pt_dt = regexp_replace(substr(b.chg_date,1,10), '-', ''))
    
    import pyspark.sql.functions as F
    from pyspark.sql import Window
    from datetime import datetime
    
    data = [("985XFT82Y4", "20200824"),
    ("985XFT82Y4", "20200826"), 
    ("985XFT82Y4", "20200902"), 
    ("985XFT82Y4", "20200918"), 
    ("985XFT82Y4", "20200930"), 
    ("985XFT82Y4", "20201016"), 
    ("985XFT82Y4", "20201021"), 
    ("985XFT82Y4", "20201102"), 
    ("985XFT82Y4", "20201111"), 
    ("985XFT82Y4", "20201112"), 
    ("985XFT82Y4", "20201208"), 
    ("985XFT82Y4", "20210111"), 
    ("985XFT82Y4", "20210202"), 
    ("985XFT82Y4", "20210303"), 
    ("985XFT82Y4", "20210309"), 
    ("985XFT82Y4", "20210311")] 
    
    df1 = spark.createDataFrame(data, ["cust_id", "pt_dt"]).withColumn("pt_dt", F.to_timestamp("pt_dt", "yyyyMMdd")).withColumn("pt_dt", F.date_format(F.col('pt_dt'),"yyyy-MM-dd"))
    df1.show()
    
    data1 = [("985XFT82Y4", "2020-08-24 22:12:34.332000",  "subscribed"),   
    ("985XFT82Y4", "2020-11-11 14:45:31.152000",  "installed"),     
    ("985XFT82Y4", "2021-02-02 01:26:34.500000",  "migration"),     
    ("985XFT82Y4", "2021-03-09 08:11:57.790000",  "setup done")]    
    ts_pattern = "yyyy-MM-dd HH:mm:ss.SSSSSS"
    df2 = spark.createDataFrame(data1, ["cust_id", "chg_date", "ins_status"]).withColumn("chg_date", F.to_timestamp("chg_date", ts_pattern)).withColumn("chg_date", F.date_format(F.col('chg_date'),"yyyy-MM-dd"))
    df2.show()
    
    window_spec = Window.partitionBy("cust_id").orderBy("chg_date")
    df2 = df2.withColumn("end_chg_date", F.lead("chg_date").over(window_spec))
    df2 = df2.withColumn("end_chg_date", F.when(F.col("end_chg_date").isNull(), F.lit(datetime.now().strftime("%Y-%m-%d"))).otherwise(F.col("end_chg_date")))
    df2.show()
    
    +----------+----------+----------+------------+
    |   cust_id|  chg_date|ins_status|end_chg_date|
    +----------+----------+----------+------------+
    |985XFT82Y4|2020-08-24|subscribed|  2021-03-16|
    |985XFT82Y4|2020-11-11| installed|  2021-02-02|
    |985XFT82Y4|2021-02-02| migration|  2021-03-09|
    |985XFT82Y4|2021-03-09|setup done|  2021-03-16|
    +----------+----------+----------+------------+
    
    cond = [df1["cust_id"] == df2["cust_id"], df1["pt_dt"] >= df2["chg_date"], df1["pt_dt"] < df2["end_chg_date"]]
    df3 = df1.join(df2, cond, "left").select(df1["cust_id"], df1["pt_dt"], "ins_status").orderBy("pt_dt")
    # use df1 in select to resolve same column name conflict
    df3.show()
    
    
    +----------+----------+----------+
    |   cust_id|     pt_dt|ins_status|
    +----------+----------+----------+
    |985XFT82Y4|2020-08-24|subscribed|
    |985XFT82Y4|2020-08-26|subscribed|
    |985XFT82Y4|2020-09-02|subscribed|
    |985XFT82Y4|2020-09-18|subscribed|
    |985XFT82Y4|2020-09-30|subscribed|
    |985XFT82Y4|2020-10-16|subscribed|
    |985XFT82Y4|2020-10-21|subscribed|
    |985XFT82Y4|2020-11-02|subscribed|
    |985XFT82Y4|2020-11-11| installed|
    |985XFT82Y4|2020-11-12| installed|
    |985XFT82Y4|2020-12-08| installed|
    |985XFT82Y4|2021-01-11| installed|
    |985XFT82Y4|2021-02-02| migration|
    |985XFT82Y4|2021-03-03| migration|
    |985XFT82Y4|2021-03-09|setup done|
    |985XFT82Y4|2021-03-11|setup done|
    +----------+----------+----------+
    
    from pyspark.sql import functions as F, Window
    
    df3 = df1.withColumn('pt_date', F.to_date(df1.pt_dt.cast('string'), 'yyyyMMdd'))
    df4 = df2.withColumn('next_date', F.lead('chg_date').over(Window.partitionBy('cust_id').orderBy('chg_date')))
    
    result = df3.join(df4, 
        (df3.cust_id == df4.cust_id) & 
        (df3.pt_date >= df4.chg_date) & 
        ((df3.pt_date < df4.next_date) | df4.next_date.isNull()), 
        'left'
    ).select(df3.cust_id, df3.pt_dt, df4.ins_status)
    
    result.show()
    +----------+--------+----------+
    |   cust_id|   pt_dt|ins_status|
    +----------+--------+----------+
    |985XFT82Y4|20200824|subscribed|
    |985XFT82Y4|20200826|subscribed|
    |985XFT82Y4|20200902|subscribed|
    |985XFT82Y4|20200918|subscribed|
    |985XFT82Y4|20200930|subscribed|
    |985XFT82Y4|20201016|subscribed|
    |985XFT82Y4|20201021|subscribed|
    |985XFT82Y4|20201102|subscribed|
    |985XFT82Y4|20201111| installed|
    |985XFT82Y4|20201112| installed|
    |985XFT82Y4|20201208| installed|
    |985XFT82Y4|20210111| installed|
    |985XFT82Y4|20210202| migration|
    |985XFT82Y4|20210303| migration|
    |985XFT82Y4|20210309|setup done|
    |985XFT82Y4|20210311|setup done|
    +----------+--------+----------+