Python 数据分组问题,但基于;“窗口”;

Python 数据分组问题,但基于;“窗口”;,python,pyspark,window,Python,Pyspark,Window,全部, 我有一个数据集,其定义如下: eno|date|attendance 1|01-Jan-2010|P 1|02-Jan-2010|P 1|03-Jan-2010|A 1|04-Jan-2010|P 1|05-Jan-2010|P 2|01-Jan-2010|P 2|02-Jan-2010|P 2|03-Jan-2010|P 2|04-Jan-2010|A 2|05-Jan-2010|P 对于每个员工,要求创建一个“间隔组”,它基本上按照时间顺序对出勤值进行分组。组是将相似的出勤值分组到

全部,

我有一个数据集,其定义如下:

eno|date|attendance
1|01-Jan-2010|P
1|02-Jan-2010|P
1|03-Jan-2010|A
1|04-Jan-2010|P
1|05-Jan-2010|P
2|01-Jan-2010|P
2|02-Jan-2010|P
2|03-Jan-2010|P
2|04-Jan-2010|A
2|05-Jan-2010|P
对于每个员工,要求创建一个“间隔组”,它基本上按照时间顺序对出勤值进行分组。组是将相似的出勤值分组到一起,直到看到新的出勤值。因此,预期产出为:

eno|date|attendance|attendanceGroup
1|01-Jan-2010|P|1
1|02-Jan-2010|P|1
1|03-Jan-2010|A|2
1|04-Jan-2010|P|3
1|05-Jan-2010|P|3
2|01-Jan-2010|P|1
2|02-Jan-2010|P|1
2|03-Jan-2010|P|1
2|04-Jan-2010|A|2
2|05-Jan-2010|P|3
到目前为止,我所能做的就是获取前一行的出勤值,但完全不知道如何从这里开始…提前感谢

from datetime import datetime, timedelta
EmployeeAttendance = Row("eno", "date", "attendance")
EmpAttRowList = [EmployeeAttendance("1", datetime.now().date() - timedelta(days=100), "Y"),
                 EmployeeAttendance("1", datetime.now().date() - timedelta(days=99), "Y"),
                 EmployeeAttendance("1", datetime.now().date() - timedelta(days=98), "N"),
                 EmployeeAttendance("1", datetime.now().date() - timedelta(days=97), "Y"),
                 EmployeeAttendance("1", datetime.now().date() - timedelta(days=96), "Y"),
                 EmployeeAttendance("1", datetime.now().date() - timedelta(days=95), "N"),
                 EmployeeAttendance("1", datetime.now().date() - timedelta(days=94), "Y"),
                 EmployeeAttendance("1", datetime.now().date() - timedelta(days=93), "Y"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=100), "Y"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=99), "Y"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=98), "N"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=97), "Y"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=96), "Y"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=95), "N"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=94), "N"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=93), "N"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=92), "Y"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=91), "Y"),
                 EmployeeAttendance("2", datetime.now().date() - timedelta(days=90), "N"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=97), "Y"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=96), "Y"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=95), "Y"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=94), "N"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=93), "N"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=92), "Y"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=91), "Y"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=90), "Y"),
                 EmployeeAttendance("3", datetime.now().date() - timedelta(days=89), "Y")
                ]

df = spark.createDataFrame(EmpAttRowList, EmployeeAttendance)
window = Window.partitionBy(df['eno']).orderBy("date")
previousrowattendance = lag(df["attendance"]).over(window)

考虑到您已经使用上述代码创建了数据帧,您可以使用以下代码来获取attendanceGroup。让我知道它是否有效

import pyspark.sql.functions as F
from pyspark.sql import Window

winSpec = Window.partitionBy('eno').orderBy('date')
df_unique = df.withColumn('prevAttendance', F.lag('attendance').over(winSpec))
df_unique = df_unique.filter((df_unique.attendance != df_unique.prevAttendance) | F.col('prevAttendance').isNull())
df_unique = df_unique.withColumn('attendanceGroup', F.row_number().over(winSpec))
df_unique = df_unique.withColumnRenamed('eno', 'eno_t').withColumnRenamed('date', 'date_t').drop('attendance').drop('prevAttendance')
df = df.join(df_unique, (df.eno == df_unique.eno_t) & (df.date == df_unique.date_t), 'left').drop('eno_t').drop('date_t')
df = df.withColumn('attendanceGroup', F.last('attendanceGroup', ignorenulls = True).over(winSpec))
df.orderBy('eno', 'date').show(10, False)

+---+----------+----------+---------------+
|eno|date      |attendance|attendanceGroup|
+---+----------+----------+---------------+
|1  |2019-08-16|Y         |1              |
|1  |2019-08-17|Y         |1              |
|1  |2019-08-18|N         |2              |
|1  |2019-08-19|Y         |3              |
|1  |2019-08-20|Y         |3              |
|1  |2019-08-21|N         |4              |
|1  |2019-08-22|Y         |5              |
|1  |2019-08-23|Y         |5              |
|2  |2019-08-16|Y         |1              |
|2  |2019-08-17|Y         |1              |
+---+----------+----------+---------------+
only showing top 10 rows

考虑到您已经使用上述代码创建了数据帧,您可以使用以下代码来获取attendanceGroup。让我知道它是否有效

import pyspark.sql.functions as F
from pyspark.sql import Window

winSpec = Window.partitionBy('eno').orderBy('date')
df_unique = df.withColumn('prevAttendance', F.lag('attendance').over(winSpec))
df_unique = df_unique.filter((df_unique.attendance != df_unique.prevAttendance) | F.col('prevAttendance').isNull())
df_unique = df_unique.withColumn('attendanceGroup', F.row_number().over(winSpec))
df_unique = df_unique.withColumnRenamed('eno', 'eno_t').withColumnRenamed('date', 'date_t').drop('attendance').drop('prevAttendance')
df = df.join(df_unique, (df.eno == df_unique.eno_t) & (df.date == df_unique.date_t), 'left').drop('eno_t').drop('date_t')
df = df.withColumn('attendanceGroup', F.last('attendanceGroup', ignorenulls = True).over(winSpec))
df.orderBy('eno', 'date').show(10, False)

+---+----------+----------+---------------+
|eno|date      |attendance|attendanceGroup|
+---+----------+----------+---------------+
|1  |2019-08-16|Y         |1              |
|1  |2019-08-17|Y         |1              |
|1  |2019-08-18|N         |2              |
|1  |2019-08-19|Y         |3              |
|1  |2019-08-20|Y         |3              |
|1  |2019-08-21|N         |4              |
|1  |2019-08-22|Y         |5              |
|1  |2019-08-23|Y         |5              |
|2  |2019-08-16|Y         |1              |
|2  |2019-08-17|Y         |1              |
+---+----------+----------+---------------+
only showing top 10 rows
您可以尝试以下方法:

  • 创建一个带有条件的
    grp
    标志
    attention!=滞后(出勤)
    以便于在旗帜上求和

  • 创建一个由原始id
    eno
    和新创建的
    grp
    标志列划分的新窗口,并应用一个
    sum
    ,基本上添加1以从1开始计数器

  • 输出

    +---+----------+----------+-----+
    |eno|      date|attendance|group|
    +---+----------+----------+-----+
    |  1|2019-08-17|         Y|    1|
    |  1|2019-08-18|         Y|    1|
    |  1|2019-08-19|         N|    2|
    |  1|2019-08-20|         Y|    3|
    |  1|2019-08-21|         Y|    1|
    |  1|2019-08-22|         N|    4|
    |  1|2019-08-23|         Y|    5|
    |  1|2019-08-24|         Y|    1|
    |  2|2019-08-17|         Y|    1|
    |  2|2019-08-18|         Y|    1|
    |  2|2019-08-19|         N|    2|
    |  2|2019-08-20|         Y|    3|
    |  2|2019-08-21|         Y|    1|
    |  2|2019-08-22|         N|    4|
    |  2|2019-08-23|         N|    1|
    |  2|2019-08-24|         N|    1|
    |  2|2019-08-25|         Y|    5|
    |  2|2019-08-26|         Y|    1|
    |  2|2019-08-27|         N|    6|
    |  3|2019-08-20|         Y|    1|
    +---+----------+----------+-----+
    only showing top 20 rows
    
    您可以尝试以下方法:

  • 创建一个带有条件的
    grp
    标志
    attention!=滞后(出勤)
    以便于在旗帜上求和

  • 创建一个由原始id
    eno
    和新创建的
    grp
    标志列划分的新窗口,并应用一个
    sum
    ,基本上添加1以从1开始计数器

  • 输出

    +---+----------+----------+-----+
    |eno|      date|attendance|group|
    +---+----------+----------+-----+
    |  1|2019-08-17|         Y|    1|
    |  1|2019-08-18|         Y|    1|
    |  1|2019-08-19|         N|    2|
    |  1|2019-08-20|         Y|    3|
    |  1|2019-08-21|         Y|    1|
    |  1|2019-08-22|         N|    4|
    |  1|2019-08-23|         Y|    5|
    |  1|2019-08-24|         Y|    1|
    |  2|2019-08-17|         Y|    1|
    |  2|2019-08-18|         Y|    1|
    |  2|2019-08-19|         N|    2|
    |  2|2019-08-20|         Y|    3|
    |  2|2019-08-21|         Y|    1|
    |  2|2019-08-22|         N|    4|
    |  2|2019-08-23|         N|    1|
    |  2|2019-08-24|         N|    1|
    |  2|2019-08-25|         Y|    5|
    |  2|2019-08-26|         Y|    1|
    |  2|2019-08-27|         N|    6|
    |  3|2019-08-20|         Y|    1|
    +---+----------+----------+-----+
    only showing top 20 rows
    

    我认为上述解决方案需要改进。记录:(1#2019-08-21#Y)应将组号标记为3,而不是1。我认为上述解决方案需要改进。记录:(1#2019-08-21#Y)答案中应将组号标记为3,而不是1。非常感谢您的帮助。这
    df=df.withColumn('attendanceGroup',F.last('attendanceGroup',ignorenulls=True)。over(winSpec))
    非常感谢您的帮助这
    df=df.withColumn('attendanceGroup',.last.)('attendanceGroup',ignorenulls=True).over(winSpec))
    非常可爱。。