Apache spark 基于数据帧条件在Spark中创建自定义计数器 当前数据集 +---+-----+-----+-----+----+ |ID |事件|索引|开始|结束| +---+-----+-----+-----+----+ |1 |运行| 0 |启动|空| |1 |运行| 1 |空|空| |1 |运行| 2 |空|空| |1 |游泳| 3 |无效|结束| |1 |运行| 4 |启动|空| |1 |游泳| 5 |零|零| |1 |游泳| 6 |无效|结束| |1 |运行| 7 |启动|空| |1 |运行| 8 |空|空| |1 |运行| 9 |空|空| |1 |游泳| 10 |无效|结束| |1 |运行| 11 |启动|空| |1 |运行| 12 |空|空| |1 |运行| 13 |空|结束| |2 |运行| 14 |启动|空| |2 |运行| 15 |空|空| |2 |运行| 16 |空|空| |2 |游泳| 17 |无效|结束| |2 |运行| 18 |启动|空| |2 |游泳| 19 |零|零| |2 |游泳| 20 |零|零| |2 |游泳| 21 |零|零| |2 |游泳| 22 |无效|结束| |2 |运行| 23 |启动|空| |2 |运行| 24 |空|空| |2 |运行| 25 |空|结束| |3 |运行| 26 |启动|空| |3 |运行| 27 |空|空| |3 |游泳| 28 |零|零| +---+-----+-----+-----+----+ 我在找谁 +---+-----+-----+-----+----+-------+ |ID |事件|索引|开始|结束|事件ID| +---+-----+-----+-----+----+-------+ |1 |运行| 0 |启动|空| 1| |1 |运行| 1 |空|空| 1| |1 |跑| 2 |空|空| 1| |1 |游泳| 3 |无效|结束| 1| |1 |运行| 4 |启动|无效| 2| |1 |游泳| 5 |零|零| 2| |1 |游泳| 6 |无效|结束| 2| |1 |运行| 7 |启动|无效| 3| |1 |跑| 8 |零|零| 3| |1 |跑| 9 |空|空| 3| |1 |游泳| 10 |无效|结束| 3| |1 |运行| 11 |启动|无效| 4| |1 |运行| 12 |空|空| 4| |1 |运行| 13 |无效|结束| 4| |2 |运行| 14 |启动|无效| 1| |2 |运行| 15 |空|空| 1| |2 |运行| 16 |空|空| 1| |2 |游泳| 17 |无效|结束| 1| |2 |运行| 18 |启动|无效| 2| |2 |游泳| 19 |零|零| 2| |2 |游泳| 20 |零|零| 2| |2 |游泳| 21 |零|零| 2| |2 |游泳| 22 |无效|结束| 2| |2 |运行| 23 |启动|无效| 3| |2 |跑| 24 |空|空| 3| |2 |运行| 25 |无效|结束| 3| |3 |运行| 26 |启动|无效| 1| |3 |跑| 27 |零|零| 1| |3 |游泳| 28 |零|零| 1| +---+-----+-----+-----+----+-------+
我正在尝试创建上面的EventID列。有没有办法在udf中创建一个计数器,根据列条件进行更新?注意,我不确定UDF是否是这里的最佳方法 以下是我目前的思维逻辑:Apache spark 基于数据帧条件在Spark中创建自定义计数器 当前数据集 +---+-----+-----+-----+----+ |ID |事件|索引|开始|结束| +---+-----+-----+-----+----+ |1 |运行| 0 |启动|空| |1 |运行| 1 |空|空| |1 |运行| 2 |空|空| |1 |游泳| 3 |无效|结束| |1 |运行| 4 |启动|空| |1 |游泳| 5 |零|零| |1 |游泳| 6 |无效|结束| |1 |运行| 7 |启动|空| |1 |运行| 8 |空|空| |1 |运行| 9 |空|空| |1 |游泳| 10 |无效|结束| |1 |运行| 11 |启动|空| |1 |运行| 12 |空|空| |1 |运行| 13 |空|结束| |2 |运行| 14 |启动|空| |2 |运行| 15 |空|空| |2 |运行| 16 |空|空| |2 |游泳| 17 |无效|结束| |2 |运行| 18 |启动|空| |2 |游泳| 19 |零|零| |2 |游泳| 20 |零|零| |2 |游泳| 21 |零|零| |2 |游泳| 22 |无效|结束| |2 |运行| 23 |启动|空| |2 |运行| 24 |空|空| |2 |运行| 25 |空|结束| |3 |运行| 26 |启动|空| |3 |运行| 27 |空|空| |3 |游泳| 28 |零|零| +---+-----+-----+-----+----+ 我在找谁 +---+-----+-----+-----+----+-------+ |ID |事件|索引|开始|结束|事件ID| +---+-----+-----+-----+----+-------+ |1 |运行| 0 |启动|空| 1| |1 |运行| 1 |空|空| 1| |1 |跑| 2 |空|空| 1| |1 |游泳| 3 |无效|结束| 1| |1 |运行| 4 |启动|无效| 2| |1 |游泳| 5 |零|零| 2| |1 |游泳| 6 |无效|结束| 2| |1 |运行| 7 |启动|无效| 3| |1 |跑| 8 |零|零| 3| |1 |跑| 9 |空|空| 3| |1 |游泳| 10 |无效|结束| 3| |1 |运行| 11 |启动|无效| 4| |1 |运行| 12 |空|空| 4| |1 |运行| 13 |无效|结束| 4| |2 |运行| 14 |启动|无效| 1| |2 |运行| 15 |空|空| 1| |2 |运行| 16 |空|空| 1| |2 |游泳| 17 |无效|结束| 1| |2 |运行| 18 |启动|无效| 2| |2 |游泳| 19 |零|零| 2| |2 |游泳| 20 |零|零| 2| |2 |游泳| 21 |零|零| 2| |2 |游泳| 22 |无效|结束| 2| |2 |运行| 23 |启动|无效| 3| |2 |跑| 24 |空|空| 3| |2 |运行| 25 |无效|结束| 3| |3 |运行| 26 |启动|无效| 1| |3 |跑| 27 |零|零| 1| |3 |游泳| 28 |零|零| 1| +---+-----+-----+-----+----+-------+,apache-spark,pyspark,apache-spark-sql,user-defined-functions,Apache Spark,Pyspark,Apache Spark Sql,User Defined Functions,我正在尝试创建上面的EventID列。有没有办法在udf中创建一个计数器,根据列条件进行更新?注意,我不确定UDF是否是这里的最佳方法 以下是我目前的思维逻辑: 当看到“开始”值时,开始计数 当看到“结束”值时,结束计数 每次看到新的ID时,将计数器重置为1 谢谢大家的帮助 以下是生成当前数据帧的原始代码: # Current Dataset data = [ (1, "run", 0, 'start', None), (1, "r
- 当看到“开始”值时,开始计数
- 当看到“结束”值时,结束计数
- 每次看到新的ID时,将计数器重置为1
# Current Dataset
data = [
(1, "run", 0, 'start', None),
(1, "run", 1, None, None),
(1, "run", 2, None, None),
(1, "swim", 3, None, 'end'),
(1, "run", 4, 'start',None),
(1, "swim", 5, None, None),
(1, "swim", 6, None, 'end'),
(1, "run",7, 'start', None),
(1, "run",8, None, None),
(1, "run",9, None, None),
(1, "swim",10, None, 'end'),
(1, "run",11, 'start', None),
(1, "run",12, None, None),
(1, "run",13, None, 'end'),
(2, "run",14, 'start', None),
(2, "run",15, None, None),
(2, "run",16, None, None),
(2, "swim",17, None, 'end'),
(2, "run",18, 'start', None),
(2, "swim",19, None, None),
(2, "swim",20, None, None),
(2, "swim",21, None, None),
(2, "swim",22, None, 'end'),
(2, "run",23, 'start', None),
(2, "run",24, None, None),
(2, "run",25, None, 'end'),
(3, "run",26, 'start', None),
(3, "run",27, None, None),
(3, "swim",28, None, None)
]
schema = StructType([
StructField('ID', IntegerType(),True), \
StructField('Event', StringType(),True), \
StructField('Index', IntegerType(),True), \
StructField('start', StringType(),True), \
StructField('end', StringType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df.show(30)
您可以使用窗口功能:
导入pyspark.sql.F函数
从pyspark.sql.window导入窗口
w=Window.partitionBy('ID')。中间的行(Window.unbounddpreceiding,0)。orderBy('index'))
df.withColumn('EventId',F.sum)(F.when(F.col('start')=='start',1)。否则(0))\
.over(w)).orderBy('ID','Index').show(100)
导致
+---+-----+-----+-----+----+-------+
| ID|Event|Index|start| end|EventId|
+---+-----+-----+-----+----+-------+
| 1| run| 0|start|null| 1|
| 1| run| 1| null|null| 1|
| 1| run| 2| null|null| 1|
| 1| swim| 3| null| end| 1|
| 1| run| 4|start|null| 2|
| 1| swim| 5| null|null| 2|
| 1| swim| 6| null| end| 2|
| 1| run| 7|start|null| 3|
| 1| run| 8| null|null| 3|
| 1| run| 9| null|null| 3|
| 1| swim| 10| null| end| 3|
| 1| run| 11|start|null| 4|
| 1| run| 12| null|null| 4|
| 1| run| 13| null| end| 4|
| 2| run| 14|start|null| 1|
| 2| run| 15| null|null| 1|
| 2| run| 16| null|null| 1|
| 2| swim| 17| null| end| 1|
| 2| run| 18|start|null| 2|
| 2| swim| 19| null|null| 2|
| 2| swim| 20| null|null| 2|
| 2| swim| 21| null|null| 2|
| 2| swim| 22| null| end| 2|
| 2| run| 23|start|null| 3|
| 2| run| 24| null|null| 3|
| 2| run| 25| null| end| 3|
| 3| run| 26|start|null| 1|
| 3| run| 27| null|null| 1|
| 3| swim| 28| null|null| 1|
+---+-----+-----+-----+----+-------+
您可以根据最近的开始时间计算密集度等级:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'laststart',
F.last(F.when(F.col('start') == 'start', F.col('Index')), True).over(Window.partitionBy('ID').orderBy('Index'))
).withColumn(
'EventID',
F.dense_rank().over(Window.partitionBy('ID').orderBy('laststart'))
)
df2.show(999)
+---+-----+-----+-----+----+---------+-------+
| ID|Event|Index|start| end|laststart|EventID|
+---+-----+-----+-----+----+---------+-------+
| 1| run| 0|start|null| 0| 1|
| 1| run| 1| null|null| 0| 1|
| 1| run| 2| null|null| 0| 1|
| 1| swim| 3| null| end| 0| 1|
| 1| run| 4|start|null| 4| 2|
| 1| swim| 5| null|null| 4| 2|
| 1| swim| 6| null| end| 4| 2|
| 1| run| 7|start|null| 7| 3|
| 1| run| 8| null|null| 7| 3|
| 1| run| 9| null|null| 7| 3|
| 1| swim| 10| null| end| 7| 3|
| 1| run| 11|start|null| 11| 4|
| 1| run| 12| null|null| 11| 4|
| 1| run| 13| null| end| 11| 4|
| 2| run| 14|start|null| 14| 1|
| 2| run| 15| null|null| 14| 1|
| 2| run| 16| null|null| 14| 1|
| 2| swim| 17| null| end| 14| 1|
| 2| run| 18|start|null| 18| 2|
| 2| swim| 19| null|null| 18| 2|
| 2| swim| 20| null|null| 18| 2|
| 2| swim| 21| null|null| 18| 2|
| 2| swim| 22| null| end| 18| 2|
| 2| run| 23|start|null| 23| 3|
| 2| run| 24| null|null| 23| 3|
| 2| run| 25| null| end| 23| 3|
| 3| run| 26|start|null| 26| 1|
| 3| run| 27| null|null| 26| 1|
| 3| swim| 28| null|null| 26| 1|
+---+-----+-----+-----+----+---------+-------+
在您当前的设置中,您不需要额外的条件来结束计数,因为在下一行中总是有开始。这是正确的假设吗?如果没有,你能提供一个规则在这种情况下该怎么办?