Python 当在另一列中遇到特定值时,递增计数器列
我有一个pyspark数据帧,其中有一列new_session,其值为1或0。我想创建另一列(session_id),以便在新_会话的值变为1时增加计数器 样本输入:Python 当在另一列中遇到特定值时,递增计数器列,python,pyspark,Python,Pyspark,我有一个pyspark数据帧,其中有一列new_session,其值为1或0。我想创建另一列(session_id),以便在新_会话的值变为1时增加计数器 样本输入: df_tes = spark_session.createDataFrame([ (1, "item_1"), (1, "item_2"), (0, "item_3"), (0, "item_4"), (1, "item_1")], ["new_session", "item"]) 样本
df_tes = spark_session.createDataFrame([
(1, "item_1"),
(1, "item_2"),
(0, "item_3"),
(0, "item_4"),
(1, "item_1")], ["new_session", "item"])
样本输出:
+-----------+------+-----------+
|new_session| item| session_id|
+-----------+------+-----------+
| 1|item_1| 1 |
| 1|item_2| 2 |
| 0|item_3| 2 |
| 0|item_4| 2 |
| 1|item_1| 3 |
+-----------+------+-----------+
试试这个:
new\u session==2
,以证明该算法适用于两种以上的new\u session
李>
new\u session==1的行生成正确的ID李>
生成最终的会话id
s
您可以安全地将辅助列放在这里,并获得会话id
看看这个问题。非常相似。下面是如何实现累积和列:
+-----------+------+-----------+
|new_session| item| session_id|
+-----------+------+-----------+
| 1|item_1| 1 |
| 1|item_2| 2 |
| 0|item_3| 2 |
| 0|item_4| 2 |
| 1|item_1| 3 |
+-----------+------+-----------+
df_tes = spark.createDataFrame([
(1, "item_1"),
(1, "item_2"),
(0, "item_3"),
(2, "item_7"),
(0, "item_4"),
(2, "item_8"),
(1, "item_1")], ["new_session", "item"])
df_tes.show()
+-----------+------+
|new_session| item|
+-----------+------+
| 1|item_1|
| 1|item_2|
| 0|item_3|
| 2|item_7|
| 0|item_4|
| 2|item_8|
| 1|item_1|
+-----------+------+
# Create a `dummy` column so we know the original sort order for the rows. If
# you already have a column for this, you don't need to create the `dummy` column.
df = df_tes.withColumn('dummy', F.monotonically_increasing_id())
# Create the correct IDs using rows with `new_session == 1`, note we use `dummy` to keep the original order.
df = df.withColumn('id_temp', F.row_number().over(Window.orderBy('dummy').partitionBy('new_session'))).orderBy('dummy')
# Put all rows with new_session != 1 into the same group with 'used == 0'. This
# helps us to handle cases when there are more than two types of `new_session`.
df = df.withColumn('used', F.when(F.col('new_session')==1, 1).otherwise(0))
df.show()
+-----------+------+-----------+-------+----+
|new_session| item| dummy|id_temp|used|
+-----------+------+-----------+-------+----+
| 1|item_1| 8589934592| 1| 1|
| 1|item_2|17179869184| 2| 1|
| 0|item_3|25769803776| 1| 0|
| 2|item_7|34359738368| 1| 0|
| 0|item_4|42949672960| 2| 0|
| 2|item_8|51539607552| 2| 0|
| 1|item_1|60129542144| 3| 1|
+-----------+------+-----------+-------+----+
# First, for each row, we use `lag` function to get the `id` from its previous
# row. So for one row next to a row of `new_session==1`, it'll pick its correct ID here.
w = Window.orderBy("dummy").rowsBetween(-1, -1)
df = df.withColumn('lag_id', F.lag('id_temp', 1, 1).over(w))
# For the rows with `used==0` (i.e. `new_session!=1`), use `first` to apply the
# correct IDs to all rows.
df = df.withColumn('lag_id', F.first('lag_id').over(Window.partitionBy('used')))
# Now we can use if-else to set the `session_id` properly.
df = df.withColumn('session_id', F.when(F.col('used')==1, F.col('id_temp')).otherwise(F.col('lag_id')))
df.orderBy('dummy').show()
+-----------+------+-----------+-------+----+------+----------+
|new_session| item| dummy|id_temp|used|lag_id|session_id|
+-----------+------+-----------+-------+----+------+----------+
| 1|item_1| 8589934592| 1| 1| 1| 1|
| 1|item_2|17179869184| 2| 1| 1| 2|
| 0|item_3|25769803776| 1| 0| 2| 2|
| 2|item_7|34359738368| 1| 0| 2| 2|
| 0|item_4|42949672960| 2| 0| 2| 2|
| 2|item_8|51539607552| 2| 0| 2| 2|
| 1|item_1|60129542144| 3| 1| 1| 3|
+-----------+------+-----------+-------+----+------+----------+