Python 在PySpark中查找连续的逐月注册周期
我试图使用健康计划成员ID和注册月份的Spark数据框架来确定“连续”覆盖期,即连续注册的成员 下面是我在PySpark中使用的数据示例(Python 在PySpark中查找连续的逐月注册周期,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我试图使用健康计划成员ID和注册月份的Spark数据框架来确定“连续”覆盖期,即连续注册的成员 下面是我在PySpark中使用的数据示例(sc是SparkSession) 如果我能够在Pandas中完成此练习,我将使用下面的代码创建唯一覆盖期字段。但是,由于我正在处理的数据的大小,解决方案需要使用Spark,并且根据我到目前为止的研究(),像这样的迭代器方法并不是Spark真正要做的事情 a = 0 b = [] for i in df.gap.tolist(): if i != 1:
sc
是SparkSession)
如果我能够在Pandas中完成此练习,我将使用下面的代码创建唯一覆盖期
字段。但是,由于我正在处理的数据的大小,解决方案需要使用Spark,并且根据我到目前为止的研究(),像这样的迭代器方法并不是Spark真正要做的事情
a = 0
b = []
for i in df.gap.tolist():
if i != 1:
a += 1
b.append(a)
else:
b.append(a)
df['unique_coverage_period'] = b
print(df)
# memid month_elig gap unique_coverage_period
#0 123a 2020-01-01 0.0 1
#1 123a 2020-02-01 1.0 1
#2 123a 2020-03-01 1.0 1
#3 123a 2020-08-01 5.0 2
#4 123a 2020-09-01 1.0 2
#5 123a 2021-01-01 4.0 3
#6 456b 2020-02-01 0.0 4
#7 456b 2020-05-01 3.0 5
#8 456b 2020-06-01 1.0 5
#9 456b 2020-07-01 1.0 5
#10 456b 2020-08-01 1.0 5
#11 789c 2020-02-01 0.0 6
#12 789c 2020-03-01 1.0 6
#13 789c 2020-04-01 1.0 6
#14 789c 2020-05-01 1.0 6
#15 789c 2020-06-01 1.0 6
#16 789c 2020-07-01 1.0 6
您可以在窗口上进行滚动求和,如下所示:
from pyspark.sql import functions as F, Window
result = scdf.withColumn(
'flag',
F.sum((F.col('gap') != 1).cast('int')).over(Window.orderBy('memid', 'month_elig'))
)
result.show()
+-----+-------------------+---+----+
|memid| month_elig|gap|flag|
+-----+-------------------+---+----+
| 123a|2020-01-01 00:00:00|0.0| 1|
| 123a|2020-02-01 00:00:00|1.0| 1|
| 123a|2020-03-01 00:00:00|1.0| 1|
| 123a|2020-08-01 00:00:00|5.0| 2|
| 123a|2020-09-01 00:00:00|1.0| 2|
| 123a|2021-01-01 00:00:00|4.0| 3|
| 456b|2020-02-01 00:00:00|0.0| 4|
| 456b|2020-05-01 00:00:00|3.0| 5|
| 456b|2020-06-01 00:00:00|1.0| 5|
| 456b|2020-07-01 00:00:00|1.0| 5|
| 456b|2020-08-01 00:00:00|1.0| 5|
| 789c|2020-02-01 00:00:00|0.0| 6|
| 789c|2020-03-01 00:00:00|1.0| 6|
| 789c|2020-04-01 00:00:00|1.0| 6|
| 789c|2020-05-01 00:00:00|1.0| 6|
| 789c|2020-06-01 00:00:00|1.0| 6|
| 789c|2020-07-01 00:00:00|1.0| 6|
+-----+-------------------+---+----+
此后,我提出了一种确定独特覆盖期的替代方法。虽然我发现@mck发布的公认答案更清晰、更直截了当,但在处理8460万条记录的实际更大数据集时,下面提供的方法似乎执行得更快
# Create a new DataFrame that retains only the coverage break months and then orders each month per member
w1 = Window().partitionBy('memid').orderBy( F.col('month_elig'))
scdf1 = scdf \
.filter(F.col('gap') != 1) \
.withColumn('rank', rank().over(w1)) \
.select('memid', F.col('month_elig').alias('starter_month'), 'rank')
# Joins the two Spark Data Frames by memid and keeps only the records where the 'month_elig' is >= the 'starter_month'
scdf2 = scdf.join(scdf1, on = 'memid', how = 'inner') \
.withColumn('starter', F.when(F.col('month_elig') == F.col('starter_month'), 1) \
.otherwise(0)) \
.filter(F.col('month_elig') >= F.col('starter_month'))
# If the 'month_elig' == 'starter_month', then keep that one, otherwise keep the latest 'starter_month' for each 'month_elig' record
w2 = Window().partitionBy(['memid', 'month_elig']).orderBy(F.col('starter').desc(), F.col('rank').desc())
scdf2 = scdf2 \
.withColumn('rank', rank().over(w2)) \
.filter(F.col('rank') == 1).drop('rank') \
.withColumn('flag', F.concat(F.col('memid'), F.lit('_'), F.trunc(F.col('starter_month'), 'month'))) \
.select('memid', 'month_elig', 'gap', 'flag')
scdf2.show()
+-----+-------------------+---+---------------+
|memid| month_elig|gap| flag|
+-----+-------------------+---+---------------+
| 789c|2020-02-01 00:00:00|0.0|789c_2020-02-01|
| 789c|2020-03-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-04-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-05-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-06-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-07-01 00:00:00|1.0|789c_2020-02-01|
| 123a|2020-01-01 00:00:00|0.0|123a_2020-01-01|
| 123a|2020-02-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-03-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-08-01 00:00:00|5.0|123a_2020-08-01|
| 123a|2020-09-01 00:00:00|1.0|123a_2020-08-01|
| 123a|2021-01-01 00:00:00|4.0|123a_2021-01-01|
| 456b|2020-02-01 00:00:00|0.0|456b_2020-02-01|
| 456b|2020-05-01 00:00:00|3.0|456b_2020-05-01|
| 456b|2020-06-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-07-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-08-01 00:00:00|1.0|456b_2020-05-01|
+-----+-------------------+---+---------------+
# Create a new DataFrame that retains only the coverage break months and then orders each month per member
w1 = Window().partitionBy('memid').orderBy( F.col('month_elig'))
scdf1 = scdf \
.filter(F.col('gap') != 1) \
.withColumn('rank', rank().over(w1)) \
.select('memid', F.col('month_elig').alias('starter_month'), 'rank')
# Joins the two Spark Data Frames by memid and keeps only the records where the 'month_elig' is >= the 'starter_month'
scdf2 = scdf.join(scdf1, on = 'memid', how = 'inner') \
.withColumn('starter', F.when(F.col('month_elig') == F.col('starter_month'), 1) \
.otherwise(0)) \
.filter(F.col('month_elig') >= F.col('starter_month'))
# If the 'month_elig' == 'starter_month', then keep that one, otherwise keep the latest 'starter_month' for each 'month_elig' record
w2 = Window().partitionBy(['memid', 'month_elig']).orderBy(F.col('starter').desc(), F.col('rank').desc())
scdf2 = scdf2 \
.withColumn('rank', rank().over(w2)) \
.filter(F.col('rank') == 1).drop('rank') \
.withColumn('flag', F.concat(F.col('memid'), F.lit('_'), F.trunc(F.col('starter_month'), 'month'))) \
.select('memid', 'month_elig', 'gap', 'flag')
scdf2.show()
+-----+-------------------+---+---------------+
|memid| month_elig|gap| flag|
+-----+-------------------+---+---------------+
| 789c|2020-02-01 00:00:00|0.0|789c_2020-02-01|
| 789c|2020-03-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-04-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-05-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-06-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-07-01 00:00:00|1.0|789c_2020-02-01|
| 123a|2020-01-01 00:00:00|0.0|123a_2020-01-01|
| 123a|2020-02-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-03-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-08-01 00:00:00|5.0|123a_2020-08-01|
| 123a|2020-09-01 00:00:00|1.0|123a_2020-08-01|
| 123a|2021-01-01 00:00:00|4.0|123a_2021-01-01|
| 456b|2020-02-01 00:00:00|0.0|456b_2020-02-01|
| 456b|2020-05-01 00:00:00|3.0|456b_2020-05-01|
| 456b|2020-06-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-07-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-08-01 00:00:00|1.0|456b_2020-05-01|
+-----+-------------------+---+---------------+