Apache spark 基于数据帧条件在Spark中创建自定义计数器 当前数据集 +---+-----+-----+-----+----+ |ID |事件|索引|开始|结束| +---+-----+-----+-----+----+ |1 |运行| 0 |启动|空| |1 |运行| 1 |空|空| |1 |运行| 2 |空|空| |1 |游泳| 3 |无效|结束| |1 |运行| 4 |启动|空| |1 |游泳| 5 |零|零| |1 |游泳| 6 |无效|结束| |1 |运行| 7 |启动|空| |1 |运行| 8 |空|空| |1 |运行| 9 |空|空| |1 |游泳| 10 |无效|结束| |1 |运行| 11 |启动|空| |1 |运行| 12 |空|空| |1 |运行| 13 |空|结束| |2 |运行| 14 |启动|空| |2 |运行| 15 |空|空| |2 |运行| 16 |空|空| |2 |游泳| 17 |无效|结束| |2 |运行| 18 |启动|空| |2 |游泳| 19 |零|零| |2 |游泳| 20 |零|零| |2 |游泳| 21 |零|零| |2 |游泳| 22 |无效|结束| |2 |运行| 23 |启动|空| |2 |运行| 24 |空|空| |2 |运行| 25 |空|结束| |3 |运行| 26 |启动|空| |3 |运行| 27 |空|空| |3 |游泳| 28 |零|零| +---+-----+-----+-----+----+ 我在找谁 +---+-----+-----+-----+----+-------+ |ID |事件|索引|开始|结束|事件ID| +---+-----+-----+-----+----+-------+ |1 |运行| 0 |启动|空| 1| |1 |运行| 1 |空|空| 1| |1 |跑| 2 |空|空| 1| |1 |游泳| 3 |无效|结束| 1| |1 |运行| 4 |启动|无效| 2| |1 |游泳| 5 |零|零| 2| |1 |游泳| 6 |无效|结束| 2| |1 |运行| 7 |启动|无效| 3| |1 |跑| 8 |零|零| 3| |1 |跑| 9 |空|空| 3| |1 |游泳| 10 |无效|结束| 3| |1 |运行| 11 |启动|无效| 4| |1 |运行| 12 |空|空| 4| |1 |运行| 13 |无效|结束| 4| |2 |运行| 14 |启动|无效| 1| |2 |运行| 15 |空|空| 1| |2 |运行| 16 |空|空| 1| |2 |游泳| 17 |无效|结束| 1| |2 |运行| 18 |启动|无效| 2| |2 |游泳| 19 |零|零| 2| |2 |游泳| 20 |零|零| 2| |2 |游泳| 21 |零|零| 2| |2 |游泳| 22 |无效|结束| 2| |2 |运行| 23 |启动|无效| 3| |2 |跑| 24 |空|空| 3| |2 |运行| 25 |无效|结束| 3| |3 |运行| 26 |启动|无效| 1| |3 |跑| 27 |零|零| 1| |3 |游泳| 28 |零|零| 1| +---+-----+-----+-----+----+-------+

Apache spark 基于数据帧条件在Spark中创建自定义计数器 当前数据集 +---+-----+-----+-----+----+ |ID |事件|索引|开始|结束| +---+-----+-----+-----+----+ |1 |运行| 0 |启动|空| |1 |运行| 1 |空|空| |1 |运行| 2 |空|空| |1 |游泳| 3 |无效|结束| |1 |运行| 4 |启动|空| |1 |游泳| 5 |零|零| |1 |游泳| 6 |无效|结束| |1 |运行| 7 |启动|空| |1 |运行| 8 |空|空| |1 |运行| 9 |空|空| |1 |游泳| 10 |无效|结束| |1 |运行| 11 |启动|空| |1 |运行| 12 |空|空| |1 |运行| 13 |空|结束| |2 |运行| 14 |启动|空| |2 |运行| 15 |空|空| |2 |运行| 16 |空|空| |2 |游泳| 17 |无效|结束| |2 |运行| 18 |启动|空| |2 |游泳| 19 |零|零| |2 |游泳| 20 |零|零| |2 |游泳| 21 |零|零| |2 |游泳| 22 |无效|结束| |2 |运行| 23 |启动|空| |2 |运行| 24 |空|空| |2 |运行| 25 |空|结束| |3 |运行| 26 |启动|空| |3 |运行| 27 |空|空| |3 |游泳| 28 |零|零| +---+-----+-----+-----+----+ 我在找谁 +---+-----+-----+-----+----+-------+ |ID |事件|索引|开始|结束|事件ID| +---+-----+-----+-----+----+-------+ |1 |运行| 0 |启动|空| 1| |1 |运行| 1 |空|空| 1| |1 |跑| 2 |空|空| 1| |1 |游泳| 3 |无效|结束| 1| |1 |运行| 4 |启动|无效| 2| |1 |游泳| 5 |零|零| 2| |1 |游泳| 6 |无效|结束| 2| |1 |运行| 7 |启动|无效| 3| |1 |跑| 8 |零|零| 3| |1 |跑| 9 |空|空| 3| |1 |游泳| 10 |无效|结束| 3| |1 |运行| 11 |启动|无效| 4| |1 |运行| 12 |空|空| 4| |1 |运行| 13 |无效|结束| 4| |2 |运行| 14 |启动|无效| 1| |2 |运行| 15 |空|空| 1| |2 |运行| 16 |空|空| 1| |2 |游泳| 17 |无效|结束| 1| |2 |运行| 18 |启动|无效| 2| |2 |游泳| 19 |零|零| 2| |2 |游泳| 20 |零|零| 2| |2 |游泳| 21 |零|零| 2| |2 |游泳| 22 |无效|结束| 2| |2 |运行| 23 |启动|无效| 3| |2 |跑| 24 |空|空| 3| |2 |运行| 25 |无效|结束| 3| |3 |运行| 26 |启动|无效| 1| |3 |跑| 27 |零|零| 1| |3 |游泳| 28 |零|零| 1| +---+-----+-----+-----+----+-------+,apache-spark,pyspark,apache-spark-sql,user-defined-functions,Apache Spark,Pyspark,Apache Spark Sql,User Defined Functions,我正在尝试创建上面的EventID列。有没有办法在udf中创建一个计数器,根据列条件进行更新?注意,我不确定UDF是否是这里的最佳方法 以下是我目前的思维逻辑: 当看到“开始”值时,开始计数 当看到“结束”值时,结束计数 每次看到新的ID时,将计数器重置为1 谢谢大家的帮助 以下是生成当前数据帧的原始代码: # Current Dataset data = [ (1, "run", 0, 'start', None), (1, "r

我正在尝试创建上面的EventID列。有没有办法在udf中创建一个计数器,根据列条件进行更新?注意,我不确定UDF是否是这里的最佳方法

以下是我目前的思维逻辑:

  • 当看到“开始”值时,开始计数
  • 当看到“结束”值时,结束计数
  • 每次看到新的ID时,将计数器重置为1
谢谢大家的帮助

以下是生成当前数据帧的原始代码:

# Current Dataset

data = [
       (1, "run", 0, 'start', None),
       (1, "run", 1,  None,   None),
       (1, "run", 2,  None,   None),
       (1, "swim", 3, None,   'end'),
       (1, "run",  4, 'start',None),
       (1, "swim", 5, None,   None),
       (1, "swim", 6, None,   'end'),
       (1, "run",7, 'start',   None),
       (1, "run",8, None,   None),
       (1, "run",9, None,   None),
       (1, "swim",10, None,   'end'),
       (1, "run",11, 'start',   None),
       (1, "run",12, None,   None),
       (1, "run",13, None,   'end'),
       (2, "run",14, 'start',   None),
       (2, "run",15, None,   None),
       (2, "run",16, None,   None),
       (2, "swim",17, None,   'end'),
       (2, "run",18, 'start',   None),
       (2, "swim",19, None,   None),
       (2, "swim",20, None,   None),
       (2, "swim",21, None,   None),
       (2, "swim",22, None,   'end'),
       (2, "run",23, 'start',   None),
       (2, "run",24, None,   None),
       (2, "run",25, None,   'end'),
       (3, "run",26, 'start',   None),
       (3, "run",27, None,   None),
       (3, "swim",28, None,   None)
        ]

schema = StructType([
  StructField('ID', IntegerType(),True), \
  StructField('Event', StringType(),True), \
  StructField('Index', IntegerType(),True), \
  StructField('start', StringType(),True), \
  StructField('end', StringType(),True)
])

df = spark.createDataFrame(data=data, schema=schema)
df.show(30)

您可以使用窗口功能:

导入pyspark.sql.F函数
从pyspark.sql.window导入窗口
w=Window.partitionBy('ID')。中间的行(Window.unbounddpreceiding,0)。orderBy('index'))
df.withColumn('EventId',F.sum)(F.when(F.col('start')=='start',1)。否则(0))\
.over(w)).orderBy('ID','Index').show(100)
导致

+---+-----+-----+-----+----+-------+
| ID|Event|Index|start| end|EventId|
+---+-----+-----+-----+----+-------+
|  1|  run|    0|start|null|      1|
|  1|  run|    1| null|null|      1|
|  1|  run|    2| null|null|      1|
|  1| swim|    3| null| end|      1|
|  1|  run|    4|start|null|      2|
|  1| swim|    5| null|null|      2|
|  1| swim|    6| null| end|      2|
|  1|  run|    7|start|null|      3|
|  1|  run|    8| null|null|      3|
|  1|  run|    9| null|null|      3|
|  1| swim|   10| null| end|      3|
|  1|  run|   11|start|null|      4|
|  1|  run|   12| null|null|      4|
|  1|  run|   13| null| end|      4|
|  2|  run|   14|start|null|      1|
|  2|  run|   15| null|null|      1|
|  2|  run|   16| null|null|      1|
|  2| swim|   17| null| end|      1|
|  2|  run|   18|start|null|      2|
|  2| swim|   19| null|null|      2|
|  2| swim|   20| null|null|      2|
|  2| swim|   21| null|null|      2|
|  2| swim|   22| null| end|      2|
|  2|  run|   23|start|null|      3|
|  2|  run|   24| null|null|      3|
|  2|  run|   25| null| end|      3|
|  3|  run|   26|start|null|      1|
|  3|  run|   27| null|null|      1|
|  3| swim|   28| null|null|      1|
+---+-----+-----+-----+----+-------+

您可以根据最近的开始时间计算密集度等级:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'laststart', 
    F.last(F.when(F.col('start') == 'start', F.col('Index')), True).over(Window.partitionBy('ID').orderBy('Index'))
).withColumn(
    'EventID', 
    F.dense_rank().over(Window.partitionBy('ID').orderBy('laststart'))
)

df2.show(999)
+---+-----+-----+-----+----+---------+-------+
| ID|Event|Index|start| end|laststart|EventID|
+---+-----+-----+-----+----+---------+-------+
|  1|  run|    0|start|null|        0|      1|
|  1|  run|    1| null|null|        0|      1|
|  1|  run|    2| null|null|        0|      1|
|  1| swim|    3| null| end|        0|      1|
|  1|  run|    4|start|null|        4|      2|
|  1| swim|    5| null|null|        4|      2|
|  1| swim|    6| null| end|        4|      2|
|  1|  run|    7|start|null|        7|      3|
|  1|  run|    8| null|null|        7|      3|
|  1|  run|    9| null|null|        7|      3|
|  1| swim|   10| null| end|        7|      3|
|  1|  run|   11|start|null|       11|      4|
|  1|  run|   12| null|null|       11|      4|
|  1|  run|   13| null| end|       11|      4|
|  2|  run|   14|start|null|       14|      1|
|  2|  run|   15| null|null|       14|      1|
|  2|  run|   16| null|null|       14|      1|
|  2| swim|   17| null| end|       14|      1|
|  2|  run|   18|start|null|       18|      2|
|  2| swim|   19| null|null|       18|      2|
|  2| swim|   20| null|null|       18|      2|
|  2| swim|   21| null|null|       18|      2|
|  2| swim|   22| null| end|       18|      2|
|  2|  run|   23|start|null|       23|      3|
|  2|  run|   24| null|null|       23|      3|
|  2|  run|   25| null| end|       23|      3|
|  3|  run|   26|start|null|       26|      1|
|  3|  run|   27| null|null|       26|      1|
|  3| swim|   28| null|null|       26|      1|
+---+-----+-----+-----+----+---------+-------+

在您当前的设置中,您不需要额外的条件来结束计数,因为在下一行中总是有开始。这是正确的假设吗?如果没有,你能提供一个规则在这种情况下该怎么办?