Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/359.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 当在另一列中遇到特定值时,递增计数器列_Python_Pyspark - Fatal编程技术网

Python 当在另一列中遇到特定值时,递增计数器列

Python 当在另一列中遇到特定值时,递增计数器列,python,pyspark,Python,Pyspark,我有一个pyspark数据帧,其中有一列new_session,其值为1或0。我想创建另一列(session_id),以便在新_会话的值变为1时增加计数器 样本输入: df_tes = spark_session.createDataFrame([ (1, "item_1"), (1, "item_2"), (0, "item_3"), (0, "item_4"), (1, "item_1")], ["new_session", "item"]) 样本

我有一个pyspark数据帧,其中有一列new_session,其值为1或0。我想创建另一列(session_id),以便在新_会话的值变为1时增加计数器

样本输入:

df_tes = spark_session.createDataFrame([
    (1, "item_1"),
    (1, "item_2"),
    (0, "item_3"),
    (0, "item_4"), 
    (1, "item_1")], ["new_session", "item"])
样本输出:

+-----------+------+-----------+
|new_session|  item| session_id|
+-----------+------+-----------+
|          1|item_1|   1       |
|          1|item_2|   2       |
|          0|item_3|   2       |
|          0|item_4|   2       |
|          1|item_1|   3       |
+-----------+------+-----------+
试试这个:

  • 首先创建数据帧,注意我添加了两行
    new\u session==2
    ,以证明该算法适用于两种以上的
    new\u session
  • 准备数据:我们想知道现有的排序顺序,并使用
    new\u session==1的行生成正确的ID
    
  • 生成最终的
    会话id
    s

  • 您可以安全地将辅助列放在这里,并获得
    会话id

    看看这个问题。非常相似。下面是如何实现累积和列:
    +-----------+------+-----------+
    |new_session|  item| session_id|
    +-----------+------+-----------+
    |          1|item_1|   1       |
    |          1|item_2|   2       |
    |          0|item_3|   2       |
    |          0|item_4|   2       |
    |          1|item_1|   3       |
    +-----------+------+-----------+
    
    df_tes = spark.createDataFrame([
        (1, "item_1"),
        (1, "item_2"),
        (0, "item_3"),
        (2, "item_7"),    
        (0, "item_4"), 
        (2, "item_8"), 
        (1, "item_1")], ["new_session", "item"])
    df_tes.show()
    
    +-----------+------+
    |new_session|  item|
    +-----------+------+
    |          1|item_1|
    |          1|item_2|
    |          0|item_3|
    |          2|item_7|
    |          0|item_4|
    |          2|item_8|
    |          1|item_1|
    +-----------+------+
    
    # Create a `dummy` column so we know the original sort order for the rows. If 
    # you already have a column for this, you don't need to create the `dummy` column.
    
    df = df_tes.withColumn('dummy', F.monotonically_increasing_id())
    
    # Create the correct IDs using rows with `new_session == 1`, note we use `dummy` to keep the original order.
    
    df = df.withColumn('id_temp', F.row_number().over(Window.orderBy('dummy').partitionBy('new_session'))).orderBy('dummy')
    
    # Put all rows with new_session != 1 into the same group with 'used == 0'. This 
    # helps us to handle cases when there are more than two types of `new_session`. 
    
    df = df.withColumn('used', F.when(F.col('new_session')==1, 1).otherwise(0))
    df.show()
    
    +-----------+------+-----------+-------+----+
    |new_session|  item|      dummy|id_temp|used|
    +-----------+------+-----------+-------+----+
    |          1|item_1| 8589934592|      1|   1|
    |          1|item_2|17179869184|      2|   1|
    |          0|item_3|25769803776|      1|   0|
    |          2|item_7|34359738368|      1|   0|
    |          0|item_4|42949672960|      2|   0|
    |          2|item_8|51539607552|      2|   0|
    |          1|item_1|60129542144|      3|   1|
    +-----------+------+-----------+-------+----+
    
    
    # First, for each row, we use `lag` function to get the `id` from its previous 
    # row. So for one row next to a row of `new_session==1`, it'll pick its correct ID here.
    
    w = Window.orderBy("dummy").rowsBetween(-1, -1)
    df = df.withColumn('lag_id', F.lag('id_temp', 1, 1).over(w))
    
    # For the rows with `used==0` (i.e. `new_session!=1`), use `first` to apply the 
    # correct IDs to all rows.
    
    df = df.withColumn('lag_id', F.first('lag_id').over(Window.partitionBy('used')))
    
    # Now we can use if-else to set the `session_id` properly.
    
    df = df.withColumn('session_id', F.when(F.col('used')==1, F.col('id_temp')).otherwise(F.col('lag_id')))
    df.orderBy('dummy').show()
    
    +-----------+------+-----------+-------+----+------+----------+
    |new_session|  item|      dummy|id_temp|used|lag_id|session_id|
    +-----------+------+-----------+-------+----+------+----------+
    |          1|item_1| 8589934592|      1|   1|     1|         1|
    |          1|item_2|17179869184|      2|   1|     1|         2|
    |          0|item_3|25769803776|      1|   0|     2|         2|
    |          2|item_7|34359738368|      1|   0|     2|         2|
    |          0|item_4|42949672960|      2|   0|     2|         2|
    |          2|item_8|51539607552|      2|   0|     2|         2|
    |          1|item_1|60129542144|      3|   1|     1|         3|
    +-----------+------+-----------+-------+----+------+----------+