Pyspark 生成列的值,直到下一个值到达配置单元表
我有两个数据集,如下所示Pyspark 生成列的值,直到下一个值到达配置单元表,pyspark,hive,Pyspark,Hive,我有两个数据集,如下所示 数据集1 另一个,, 2.数据集2 cust_id chg_date ins_status 985XFT82Y4 2020-08-24 22:12:34.332000 subscribed 985XFT82Y4 2020-11-11 14:45:31.152000 installed 985XFT82Y4 2021-02-02 01:26:34.500000 migration 985XF
cust_id chg_date ins_status
985XFT82Y4 2020-08-24 22:12:34.332000 subscribed
985XFT82Y4 2020-11-11 14:45:31.152000 installed
985XFT82Y4 2021-02-02 01:26:34.500000 migration
985XFT82Y4 2021-03-09 08:11:57.790000 setup done
不,我需要连接这两个数据集,并生成一个数据集数据集\u结果,该数据集应具有cust\u id、pt\u dt、ins\u status字段
连接应在客户id和pt_dt/chg_日期完成。结果应该如下所示
cust_id pt_dt ins_status
985XFT82Y4 20200824 subscribed
985XFT82Y4 20200826 subscribed
985XFT82Y4 20200902 subscribed
985XFT82Y4 20200918 subscribed
985XFT82Y4 20200930 subscribed
985XFT82Y4 20201016 subscribed
985XFT82Y4 20201021 subscribed
985XFT82Y4 20201102 subscribed
985XFT82Y4 20201111 installed
985XFT82Y4 20201112 installed
985XFT82Y4 20201208 installed
985XFT82Y4 20210111 installed
985XFT82Y4 20210202 migration
985XFT82Y4 20210303 migration
985XFT82Y4 20210309 setup done
985XFT82Y4 20210311 setup done
我曾尝试如下连接这两个数据集,但未能实现
select a.cust_id, a.pt_dt, b.ins_status
from dataset1 a
left join dataset2 b
on (a.cust_id = b.cust_id)
and (a.pt_dt = regexp_replace(substr(b.chg_date,1,10), '-', ''))
有人能建议我在Pypark或hive中做这件事的最佳方法吗
谢谢 步骤如下:
string-->timestamp-->to_date
- 在按日期排序的ID分组的窗口规范上应用lead()函数
- 如何处理第一行和最后一行中的“无”值?用今天的日期填充它们,然后执行联接并从每个数据框中选择相关列
导入pyspark.sql.F函数
从pyspark.sql导入窗口
从日期时间导入日期时间
数据=[(“985XFT82Y4”,“20200824”),
(“985XFT82Y4”、“20200826”),
(“985XFT82Y4”、“20200902”),
(“985XFT82Y4”、“20200918”),
(“985XFT82Y4”、“20200930”),
(“985XFT82Y4”、“20201016”),
(“985XFT82Y4”、“20201021”),
(“985XFT82Y4”、“20201102”),
(“985XFT82Y4”、“20201111”),
(“985XFT82Y4”、“20201112”),
(“985XFT82Y4”、“20201208”),
(“985XFT82Y4”、“20210111”),
(“985XFT82Y4”、“20210202”),
(“985XFT82Y4”、“20210303”),
(“985XFT82Y4”、“20210309”),
(“985XFT82Y4”、“20210311”)]
df1=spark.createDataFrame(数据,[“cust_id”,“pt_dt”)。带列(“pt_dt”,F.to_时间戳(“pt_dt”,“yyyyymmdd”)。带列(“pt_dt”,F.date_格式(F.col('pt_dt'),“yyyyy-MM-dd”))
df1.show()
数据1=[(“985XFT82Y4”,“2020-08-24 22:12:34.332000”,“认购”),
(“985XFT82Y4”,“2020-11-11 14:45:31.152000”,“已安装”),
(“985XFT82Y4”,“2021-02-02 01:26:34.500000”,“迁移”),
(“985XFT82Y4”,“2021-03-09 08:11:57.790000”,“设置完成”)]
ts_pattern=“yyyy-MM-dd HH:MM:ss.ssss”
df2=spark.createDataFrame(数据1,[“客户id”,“客户日期”,“客户状态])。带列(“客户日期”,F.to_时间戳(“客户日期”,ts_模式))。带列(“客户日期”,F.date格式(F.col(“客户日期”),yyyy-MM-dd)
df2.show()
window\u spec=window.partitionBy(“客户id”).orderBy(“更改日期”)
df2=df2.带列(“结束日期”,F.lead(“结束日期”)。超过(窗口规格))
df2=df2.带列(“结束日期”),F.when(F.col(“结束日期”).isNull(),F.lit(datetime.now().strftime(“%Y-%m-%d”)),否则(F.col(“结束日期”))
df2.show()
+----------+----------+----------+------------+
|客户识别号|更改日期|登录状态|结束更改日期|
+----------+----------+----------+------------+
|985XFT82Y4 | 2020-08-24 |认购| 2021-03-16|
|985XFT82Y4 | 2020-11-11 |已安装| 2021-02-02|
|985XFT82Y4 | 2021-02-02 |迁移| 2021-03-09|
|985XFT82Y4 | 2021-03-09 |设置完成| 2021-03-16|
+----------+----------+----------+------------+
条件=[df1[“客户id”]==df2[“客户id”]、df1[“客户id”]>=df2[“客户日期”]、df1[“客户id”]
步骤如下:
string-->timestamp-->to_date
- 在按日期排序的ID分组的窗口规范上应用lead()函数
- 如何处理第一行和最后一行中的“无”值?用今天的日期填充它们,然后执行联接并从每个数据框中选择相关列
导入pyspark.sql.F函数
从pyspark.sql导入窗口
从日期时间导入日期时间
数据=[(“985XFT82Y4”,“20200824”),
(“985XFT82Y4”、“20200826”),
(“985XFT82Y4”、“20200902”),
(“985XFT82Y4”、“20200918”),
(“985XFT82Y4”、“20200930”),
(“985XFT82Y4”、“20201016”),
(“985XFT82Y4”、“20201021”),
(“985XFT82Y4”、“20201102”),
(“985XFT82Y4”、“20201111”),
(“985XFT82Y4”、“20201112”),
(“985XFT82Y4”、“20201208”),
(“985XFT82Y4”、“20210111”),
(“985XFT82Y4”、“20210202”),
(“985XFT82Y4”、“20210303”),
(“985XFT82Y4”、“20210309”),
(“985XFT82Y4”、“20210311”)]
df1=spark.createDataFrame(数据,[“cust_id”,“pt_dt”)。带列(“pt_dt”,F.to_时间戳(“pt_dt”,“yyyyymmdd”)。带列(“pt_dt”,F.date_格式(F.col('pt_dt'),“yyyyy-MM-dd”))
df1.show()
数据1=[(“985XFT82Y4”,“2020-08-24 22:12:34.332000”,“认购”),
(“985XFT82Y4”,“2020-11-11 14:45:31.152000”,“已安装”),
(“985XFT82Y4”,“2021-02-02 01:26:34.500000”,“迁移”),
(“985XFT82Y4”,“2021-03-09 08:11:57.790000”,“设置完成”)]
ts_pattern=“yyyy-MM-dd HH:MM:ss.ssss”
df2=spark.createDataFrame(数据1,[“客户id”,“客户日期”,“客户状态])。带列(“客户日期”,F.to_时间戳(“客户日期”,ts_模式))。带列(“客户日期”,F.date格式(F.col(“客户日期”),yyyy-MM-dd)
df2.show()
window\u spec=window.partitionBy(“客户id”).orderBy(“更改日期”)
df2=df2.带列(“结束日期”,F.lead(“结束日期”)。超过(窗口规格))
df2=df2.带列(“结束日期”),F.when(F.col(“结束日期”).isNull(),F.lit(datetime.now().strftime(“%Y-%m-%d”)),否则(F.col(“结束日期”))
df2.show()
+----------+----------+----------+------------+
|客户识别号|更改日期|登录状态|结束更改日期|
+----------+----------+----------+------------+
|985XF
select a.cust_id, a.pt_dt, b.ins_status
from dataset1 a
left join dataset2 b
on (a.cust_id = b.cust_id)
and (a.pt_dt = regexp_replace(substr(b.chg_date,1,10), '-', ''))
import pyspark.sql.functions as F
from pyspark.sql import Window
from datetime import datetime
data = [("985XFT82Y4", "20200824"),
("985XFT82Y4", "20200826"),
("985XFT82Y4", "20200902"),
("985XFT82Y4", "20200918"),
("985XFT82Y4", "20200930"),
("985XFT82Y4", "20201016"),
("985XFT82Y4", "20201021"),
("985XFT82Y4", "20201102"),
("985XFT82Y4", "20201111"),
("985XFT82Y4", "20201112"),
("985XFT82Y4", "20201208"),
("985XFT82Y4", "20210111"),
("985XFT82Y4", "20210202"),
("985XFT82Y4", "20210303"),
("985XFT82Y4", "20210309"),
("985XFT82Y4", "20210311")]
df1 = spark.createDataFrame(data, ["cust_id", "pt_dt"]).withColumn("pt_dt", F.to_timestamp("pt_dt", "yyyyMMdd")).withColumn("pt_dt", F.date_format(F.col('pt_dt'),"yyyy-MM-dd"))
df1.show()
data1 = [("985XFT82Y4", "2020-08-24 22:12:34.332000", "subscribed"),
("985XFT82Y4", "2020-11-11 14:45:31.152000", "installed"),
("985XFT82Y4", "2021-02-02 01:26:34.500000", "migration"),
("985XFT82Y4", "2021-03-09 08:11:57.790000", "setup done")]
ts_pattern = "yyyy-MM-dd HH:mm:ss.SSSSSS"
df2 = spark.createDataFrame(data1, ["cust_id", "chg_date", "ins_status"]).withColumn("chg_date", F.to_timestamp("chg_date", ts_pattern)).withColumn("chg_date", F.date_format(F.col('chg_date'),"yyyy-MM-dd"))
df2.show()
window_spec = Window.partitionBy("cust_id").orderBy("chg_date")
df2 = df2.withColumn("end_chg_date", F.lead("chg_date").over(window_spec))
df2 = df2.withColumn("end_chg_date", F.when(F.col("end_chg_date").isNull(), F.lit(datetime.now().strftime("%Y-%m-%d"))).otherwise(F.col("end_chg_date")))
df2.show()
+----------+----------+----------+------------+
| cust_id| chg_date|ins_status|end_chg_date|
+----------+----------+----------+------------+
|985XFT82Y4|2020-08-24|subscribed| 2021-03-16|
|985XFT82Y4|2020-11-11| installed| 2021-02-02|
|985XFT82Y4|2021-02-02| migration| 2021-03-09|
|985XFT82Y4|2021-03-09|setup done| 2021-03-16|
+----------+----------+----------+------------+
cond = [df1["cust_id"] == df2["cust_id"], df1["pt_dt"] >= df2["chg_date"], df1["pt_dt"] < df2["end_chg_date"]]
df3 = df1.join(df2, cond, "left").select(df1["cust_id"], df1["pt_dt"], "ins_status").orderBy("pt_dt")
# use df1 in select to resolve same column name conflict
df3.show()
+----------+----------+----------+
| cust_id| pt_dt|ins_status|
+----------+----------+----------+
|985XFT82Y4|2020-08-24|subscribed|
|985XFT82Y4|2020-08-26|subscribed|
|985XFT82Y4|2020-09-02|subscribed|
|985XFT82Y4|2020-09-18|subscribed|
|985XFT82Y4|2020-09-30|subscribed|
|985XFT82Y4|2020-10-16|subscribed|
|985XFT82Y4|2020-10-21|subscribed|
|985XFT82Y4|2020-11-02|subscribed|
|985XFT82Y4|2020-11-11| installed|
|985XFT82Y4|2020-11-12| installed|
|985XFT82Y4|2020-12-08| installed|
|985XFT82Y4|2021-01-11| installed|
|985XFT82Y4|2021-02-02| migration|
|985XFT82Y4|2021-03-03| migration|
|985XFT82Y4|2021-03-09|setup done|
|985XFT82Y4|2021-03-11|setup done|
+----------+----------+----------+
from pyspark.sql import functions as F, Window
df3 = df1.withColumn('pt_date', F.to_date(df1.pt_dt.cast('string'), 'yyyyMMdd'))
df4 = df2.withColumn('next_date', F.lead('chg_date').over(Window.partitionBy('cust_id').orderBy('chg_date')))
result = df3.join(df4,
(df3.cust_id == df4.cust_id) &
(df3.pt_date >= df4.chg_date) &
((df3.pt_date < df4.next_date) | df4.next_date.isNull()),
'left'
).select(df3.cust_id, df3.pt_dt, df4.ins_status)
result.show()
+----------+--------+----------+
| cust_id| pt_dt|ins_status|
+----------+--------+----------+
|985XFT82Y4|20200824|subscribed|
|985XFT82Y4|20200826|subscribed|
|985XFT82Y4|20200902|subscribed|
|985XFT82Y4|20200918|subscribed|
|985XFT82Y4|20200930|subscribed|
|985XFT82Y4|20201016|subscribed|
|985XFT82Y4|20201021|subscribed|
|985XFT82Y4|20201102|subscribed|
|985XFT82Y4|20201111| installed|
|985XFT82Y4|20201112| installed|
|985XFT82Y4|20201208| installed|
|985XFT82Y4|20210111| installed|
|985XFT82Y4|20210202| migration|
|985XFT82Y4|20210303| migration|
|985XFT82Y4|20210309|setup done|
|985XFT82Y4|20210311|setup done|
+----------+--------+----------+