Python 在PySpark中查找连续的逐月注册周期

Python 在PySpark中查找连续的逐月注册周期,python,dataframe,apache-spark,pyspark,apache-spark-sql,Python,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我试图使用健康计划成员ID和注册月份的Spark数据框架来确定“连续”覆盖期,即连续注册的成员 下面是我在PySpark中使用的数据示例(sc是SparkSession) 如果我能够在Pandas中完成此练习,我将使用下面的代码创建唯一覆盖期字段。但是,由于我正在处理的数据的大小,解决方案需要使用Spark,并且根据我到目前为止的研究(),像这样的迭代器方法并不是Spark真正要做的事情 a = 0 b = [] for i in df.gap.tolist(): if i != 1:

我试图使用健康计划成员ID和注册月份的Spark数据框架来确定“连续”覆盖期,即连续注册的成员

下面是我在PySpark中使用的数据示例(
sc
是SparkSession)

如果我能够在Pandas中完成此练习,我将使用下面的代码创建
唯一覆盖期
字段。但是,由于我正在处理的数据的大小,解决方案需要使用Spark,并且根据我到目前为止的研究(),像这样的迭代器方法并不是Spark真正要做的事情

a = 0
b = []
for i in df.gap.tolist():
    if i != 1:
      a += 1
      b.append(a)
    else:
      b.append(a)

df['unique_coverage_period'] = b

print(df)

#   memid month_elig  gap  unique_coverage_period
#0   123a 2020-01-01  0.0                       1
#1   123a 2020-02-01  1.0                       1
#2   123a 2020-03-01  1.0                       1
#3   123a 2020-08-01  5.0                       2
#4   123a 2020-09-01  1.0                       2
#5   123a 2021-01-01  4.0                       3
#6   456b 2020-02-01  0.0                       4
#7   456b 2020-05-01  3.0                       5
#8   456b 2020-06-01  1.0                       5
#9   456b 2020-07-01  1.0                       5
#10  456b 2020-08-01  1.0                       5
#11  789c 2020-02-01  0.0                       6
#12  789c 2020-03-01  1.0                       6
#13  789c 2020-04-01  1.0                       6
#14  789c 2020-05-01  1.0                       6
#15  789c 2020-06-01  1.0                       6
#16  789c 2020-07-01  1.0                       6


您可以在窗口上进行滚动求和,如下所示:

from pyspark.sql import functions as F, Window

result = scdf.withColumn(
    'flag',
    F.sum((F.col('gap') != 1).cast('int')).over(Window.orderBy('memid', 'month_elig'))
)

result.show()
+-----+-------------------+---+----+
|memid|         month_elig|gap|flag|
+-----+-------------------+---+----+
| 123a|2020-01-01 00:00:00|0.0|   1|
| 123a|2020-02-01 00:00:00|1.0|   1|
| 123a|2020-03-01 00:00:00|1.0|   1|
| 123a|2020-08-01 00:00:00|5.0|   2|
| 123a|2020-09-01 00:00:00|1.0|   2|
| 123a|2021-01-01 00:00:00|4.0|   3|
| 456b|2020-02-01 00:00:00|0.0|   4|
| 456b|2020-05-01 00:00:00|3.0|   5|
| 456b|2020-06-01 00:00:00|1.0|   5|
| 456b|2020-07-01 00:00:00|1.0|   5|
| 456b|2020-08-01 00:00:00|1.0|   5|
| 789c|2020-02-01 00:00:00|0.0|   6|
| 789c|2020-03-01 00:00:00|1.0|   6|
| 789c|2020-04-01 00:00:00|1.0|   6|
| 789c|2020-05-01 00:00:00|1.0|   6|
| 789c|2020-06-01 00:00:00|1.0|   6|
| 789c|2020-07-01 00:00:00|1.0|   6|
+-----+-------------------+---+----+

此后,我提出了一种确定独特覆盖期的替代方法。虽然我发现@mck发布的公认答案更清晰、更直截了当,但在处理8460万条记录的实际更大数据集时,下面提供的方法似乎执行得更快

# Create a new DataFrame that retains only the coverage break months and then orders each month per member
w1 = Window().partitionBy('memid').orderBy( F.col('month_elig'))

scdf1 = scdf \
  .filter(F.col('gap') != 1) \
    .withColumn('rank', rank().over(w1)) \
  .select('memid', F.col('month_elig').alias('starter_month'), 'rank')

  
# Joins the two Spark Data Frames by memid and keeps only the records where the 'month_elig' is >= the 'starter_month' 
scdf2 = scdf.join(scdf1, on = 'memid', how = 'inner') \
  .withColumn('starter', F.when(F.col('month_elig') == F.col('starter_month'), 1) \
                  .otherwise(0)) \
  .filter(F.col('month_elig') >= F.col('starter_month'))
  

# If the 'month_elig' == 'starter_month', then keep that one, otherwise keep the latest 'starter_month' for each 'month_elig' record
w2 = Window().partitionBy(['memid', 'month_elig']).orderBy(F.col('starter').desc(), F.col('rank').desc())

scdf2 = scdf2 \
  .withColumn('rank', rank().over(w2)) \
  .filter(F.col('rank') == 1).drop('rank') \
  .withColumn('flag', F.concat(F.col('memid'), F.lit('_'), F.trunc(F.col('starter_month'), 'month'))) \
  .select('memid', 'month_elig', 'gap', 'flag')
  
scdf2.show()
+-----+-------------------+---+---------------+
|memid|         month_elig|gap|           flag|
+-----+-------------------+---+---------------+
| 789c|2020-02-01 00:00:00|0.0|789c_2020-02-01|
| 789c|2020-03-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-04-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-05-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-06-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-07-01 00:00:00|1.0|789c_2020-02-01|
| 123a|2020-01-01 00:00:00|0.0|123a_2020-01-01|
| 123a|2020-02-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-03-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-08-01 00:00:00|5.0|123a_2020-08-01|
| 123a|2020-09-01 00:00:00|1.0|123a_2020-08-01|
| 123a|2021-01-01 00:00:00|4.0|123a_2021-01-01|
| 456b|2020-02-01 00:00:00|0.0|456b_2020-02-01|
| 456b|2020-05-01 00:00:00|3.0|456b_2020-05-01|
| 456b|2020-06-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-07-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-08-01 00:00:00|1.0|456b_2020-05-01|
+-----+-------------------+---+---------------+
# Create a new DataFrame that retains only the coverage break months and then orders each month per member
w1 = Window().partitionBy('memid').orderBy( F.col('month_elig'))

scdf1 = scdf \
  .filter(F.col('gap') != 1) \
    .withColumn('rank', rank().over(w1)) \
  .select('memid', F.col('month_elig').alias('starter_month'), 'rank')

  
# Joins the two Spark Data Frames by memid and keeps only the records where the 'month_elig' is >= the 'starter_month' 
scdf2 = scdf.join(scdf1, on = 'memid', how = 'inner') \
  .withColumn('starter', F.when(F.col('month_elig') == F.col('starter_month'), 1) \
                  .otherwise(0)) \
  .filter(F.col('month_elig') >= F.col('starter_month'))
  

# If the 'month_elig' == 'starter_month', then keep that one, otherwise keep the latest 'starter_month' for each 'month_elig' record
w2 = Window().partitionBy(['memid', 'month_elig']).orderBy(F.col('starter').desc(), F.col('rank').desc())

scdf2 = scdf2 \
  .withColumn('rank', rank().over(w2)) \
  .filter(F.col('rank') == 1).drop('rank') \
  .withColumn('flag', F.concat(F.col('memid'), F.lit('_'), F.trunc(F.col('starter_month'), 'month'))) \
  .select('memid', 'month_elig', 'gap', 'flag')
  
scdf2.show()
+-----+-------------------+---+---------------+
|memid|         month_elig|gap|           flag|
+-----+-------------------+---+---------------+
| 789c|2020-02-01 00:00:00|0.0|789c_2020-02-01|
| 789c|2020-03-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-04-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-05-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-06-01 00:00:00|1.0|789c_2020-02-01|
| 789c|2020-07-01 00:00:00|1.0|789c_2020-02-01|
| 123a|2020-01-01 00:00:00|0.0|123a_2020-01-01|
| 123a|2020-02-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-03-01 00:00:00|1.0|123a_2020-01-01|
| 123a|2020-08-01 00:00:00|5.0|123a_2020-08-01|
| 123a|2020-09-01 00:00:00|1.0|123a_2020-08-01|
| 123a|2021-01-01 00:00:00|4.0|123a_2021-01-01|
| 456b|2020-02-01 00:00:00|0.0|456b_2020-02-01|
| 456b|2020-05-01 00:00:00|3.0|456b_2020-05-01|
| 456b|2020-06-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-07-01 00:00:00|1.0|456b_2020-05-01|
| 456b|2020-08-01 00:00:00|1.0|456b_2020-05-01|
+-----+-------------------+---+---------------+