Python 尝试基于三个条件创建新id列时出现问题?

Python 尝试基于三个条件创建新id列时出现问题?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个带有对话和时间戳的数据帧,如下所示: timestamp userID textBlob new_id 2018-10-05 23:07:02 01 a large text blob... 2018-10-05 23:07:13 01 a large text blob... 2018-10-05 23:07:23 01 a large text blob... 2018-10-05 23:07:36 01 a large text blob... 2018-10-

我有一个带有对话和时间戳的数据帧,如下所示:

timestamp   userID  textBlob    new_id
2018-10-05 23:07:02 01  a large text blob...
2018-10-05 23:07:13 01  a large text blob...
2018-10-05 23:07:23 01  a large text blob...
2018-10-05 23:07:36 01  a large text blob...
2018-10-05 23:08:02 01  a large text blob...
2018-10-05 23:09:16 01  a large text blob...
2018-10-05 23:09:21 01  a large text blob...
2018-10-05 23:09:39 01  a large text blob...
2018-10-05 23:09:47 01  a large text blob...
2018-10-05 23:10:01 01  a large text blob...
2018-10-05 23:10:11 01  a large text blob...
2018-10-05 23:10:23 01  restart             
2018-10-05 23:10:59 01  a large text blob...
2018-10-05 23:11:03 01  a large text blob...
2018-10-08 23:11:32 02  a large text blob...
2018-10-08 23:12:58 02  a large text blob...
2018-10-08 23:13:16 02  a large text blob...
2018-10-08 23:14:04 02  a large text blob...
2018-10-08 03:38:36 02  a large text blob...
2018-10-08 03:38:42 02  a large text blob...
2018-10-08 03:38:52 02  a large text blob...
2018-10-08 03:38:57 02  a large text blob...
2018-10-08 03:39:10 02  a large text blob...
2018-10-08 03:39:27 02  Restart             
2018-10-08 03:40:47 02  a large text blob...
2018-10-08 03:40:54 02  a large text blob...
2018-10-08 03:41:02 02  a large text blob...
2018-10-08 03:41:12 02  a large text blob...
2018-10-08 03:41:32 02  a large text blob...
2018-10-08 03:41:39 02  a large text blob...
2018-10-08 03:42:20 02  a large text blob...
2018-10-08 03:44:58 02  a large text blob...
2018-10-08 03:45:54 02  a large text blob...
2018-10-08 03:46:06 02  a large text blob...
2018-10-08 05:06:42 03  a large text blob...
2018-10-08 05:06:53 03  a large text blob...
2018-10-08 05:08:49 03  a large text blob...
2018-10-08 05:08:58 03  a large text blob...
2018-10-08 05:58:18 04  a large text blob...
2018-10-08 05:58:26 04  a large text blob...
2018-10-08 05:58:37 04  a large text blob...
2018-10-08 05:58:58 04  a large text blob...
2018-10-08 06:00:31 04  a large text blob...
2018-10-08 06:01:00 04  a large text blob...
2018-10-08 06:01:14 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:02:03 04  a large text blob...
2018-10-08 06:06:03 04  a large text blob...
2018-10-08 06:10:00 04  a large text blob...
2018-10-08 09:07:03 04  a large text blob...
2018-10-08 09:09:03 04  a large text blob...
2018-10-09 10:01:00 04  a large text blob...
2018-10-09 10:02:00 04  a large text blob...
2018-10-09 10:03:00 04  a large text blob...
2018-10-09 10:09:00 04  a large text blob...
2018-10-09 10:09:00 05  a large text blob...
目前,我想用一个id标识数据帧内的对话。问题是一个用户可以有多个对话(即一个
userID
可以关联多个
textBlob
)。因此,我想添加一个
新的\u id
,以便能够识别上述数据帧内的对话

为此,我想基于三个标准创建一个
新的\u id
列:

  • 10分钟周期
  • 关键字的出现
  • 当用户没有更多的TextBlob时
  • 预期的输出如下所示
    (*)

    到目前为止,我试图:

    searchfor = ['restart','Restart']
    df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))
    


    但是,我也需要考虑USER ID,最后我有几个列。是否有任何方法可以满足这三个条件并获得预期的输出

    (*)
    。若要将其放在一起,请为每个条件构建一个布尔掩码,然后将掩码转换为int并获取其累积和:

    mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
    mask2 = df['userID'].diff() != 0
    mask3 = df['textBlob'].shift().str.lower() == 'restart'
    
    df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()
    
    # Result:
    print(df.to_string(index=False))
    
    timestamp  userID              textBlob  new_id
    2018-10-05 23:07:02       1  a_large_text_blob...       1
    2018-10-05 23:07:13       1  a_large_text_blob...       1
    2018-10-05 23:07:23       1  a_large_text_blob...       1
    2018-10-05 23:07:36       1  a_large_text_blob...       1
    2018-10-05 23:08:02       1  a_large_text_blob...       1
    2018-10-05 23:09:16       1  a_large_text_blob...       1
    2018-10-05 23:09:21       1  a_large_text_blob...       1
    2018-10-05 23:09:39       1  a_large_text_blob...       1
    2018-10-05 23:09:47       1  a_large_text_blob...       1
    2018-10-05 23:10:01       1  a_large_text_blob...       1
    2018-10-05 23:10:11       1  a_large_text_blob...       1
    2018-10-05 23:10:23       1               restart       1
    2018-10-05 23:10:59       1  a_large_text_blob...       2
    2018-10-05 23:11:03       1  a_large_text_blob...       2
    2018-10-08 03:11:32       2  a_large_text_blob...       3
    2018-10-08 03:12:58       2  a_large_text_blob...       3
    2018-10-08 03:13:16       2  a_large_text_blob...       3
    2018-10-08 03:14:04       2  a_large_text_blob...       3
    2018-10-08 03:38:36       2  a_large_text_blob...       4
    2018-10-08 03:38:42       2  a_large_text_blob...       4
    2018-10-08 03:38:52       2  a_large_text_blob...       4
    2018-10-08 03:38:57       2  a_large_text_blob...       4
    2018-10-08 03:39:10       2  a_large_text_blob...       4
    2018-10-08 03:39:27       2               Restart       4
    2018-10-08 03:40:47       2  a_large_text_blob...       5
    2018-10-08 03:40:54       2  a_large_text_blob...       5
    2018-10-08 03:41:02       2  a_large_text_blob...       5
    2018-10-08 03:41:12       2  a_large_text_blob...       5
    2018-10-08 03:41:32       2  a_large_text_blob...       5
    2018-10-08 03:41:39       2  a_large_text_blob...       5
    2018-10-08 03:42:20       2  a_large_text_blob...       5
    2018-10-08 03:44:58       2  a_large_text_blob...       5
    2018-10-08 03:45:54       2  a_large_text_blob...       5
    2018-10-08 03:46:06       2  a_large_text_blob...       5
    2018-10-08 05:06:42       3  a_large_text_blob...       6
    2018-10-08 05:06:53       3  a_large_text_blob...       6
    2018-10-08 05:08:49       3  a_large_text_blob...       6
    2018-10-08 05:08:58       3  a_large_text_blob...       6
    2018-10-08 05:58:18       4  a_large_text_blob...       7
    2018-10-08 05:58:26       4  a_large_text_blob...       7
    2018-10-08 05:58:37       4  a_large_text_blob...       7
    2018-10-08 05:58:58       4  a_large_text_blob...       7
    2018-10-08 06:00:31       4  a_large_text_blob...       7
    2018-10-08 06:01:00       4  a_large_text_blob...       7
    2018-10-08 06:01:14       4  a_large_text_blob...       7
    2018-10-08 06:02:03       4  a_large_text_blob...       7
    2018-10-08 06:02:03       4  a_large_text_blob...       7
    2018-10-08 06:06:03       4  a_large_text_blob...       7
    2018-10-08 06:10:00       4  a_large_text_blob...       7
    2018-10-08 09:07:03       4  a_large_text_blob...       8
    2018-10-08 09:09:03       4  a_large_text_blob...       8
    2018-10-09 10:01:00       4  a_large_text_blob...       9
    2018-10-09 10:02:00       4  a_large_text_blob...       9
    2018-10-09 10:03:00       4  a_large_text_blob...       9
    2018-10-09 10:09:00       4  a_large_text_blob...       9
    2018-10-09 10:09:00       5  a_large_text_blob...      10
    

    好的,我认为10分钟的时间应该从对话开始算起,而不是从下面的消息算起,在这种情况下,您需要迭代如下行:

    df['timestamp'] = pd.to_datetime(df['timestamp'])
    restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
    user_change = df.userID == df.userID.shift().fillna(method='bfill')
    df['new_id'] = (restart | ~user_change).cumsum()
    current_id = 0
    new_id_prev = 0
    start_time = df.timestamp.iloc[0]
    
    for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
        timedelta = timestamp - start_time
    
        if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
            current_id += 1
            start_time = timestamp
    
        new_id_prev = new_id    
        df.new_id.iloc[i] = current_id
    

    由于
    userID
    01
    更改为
    02
    ,因此是否应在
    2018-10-05 23:11:03
    2018-10-08 03:11:32
    行之间分配新的id?另外,为什么新ID从
    005
    跳到
    007
    ?感谢@PeterLeimbigler的帮助,不,我在生成数据时出错了。。我又把它修好了,汉克斯。现在从
    008
    跳到
    010
    ,我上面提到的行仍然没有用户ID增量。@tumbleweed,我已经调整了
    重新启动
    逻辑,所以这个输出应该是您要找的。如果我还缺少什么,请告诉我。谢谢你的帮助!
    mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
    mask2 = df['userID'].diff() != 0
    mask3 = df['textBlob'].shift().str.lower() == 'restart'
    
    df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()
    
    # Result:
    print(df.to_string(index=False))
    
    timestamp  userID              textBlob  new_id
    2018-10-05 23:07:02       1  a_large_text_blob...       1
    2018-10-05 23:07:13       1  a_large_text_blob...       1
    2018-10-05 23:07:23       1  a_large_text_blob...       1
    2018-10-05 23:07:36       1  a_large_text_blob...       1
    2018-10-05 23:08:02       1  a_large_text_blob...       1
    2018-10-05 23:09:16       1  a_large_text_blob...       1
    2018-10-05 23:09:21       1  a_large_text_blob...       1
    2018-10-05 23:09:39       1  a_large_text_blob...       1
    2018-10-05 23:09:47       1  a_large_text_blob...       1
    2018-10-05 23:10:01       1  a_large_text_blob...       1
    2018-10-05 23:10:11       1  a_large_text_blob...       1
    2018-10-05 23:10:23       1               restart       1
    2018-10-05 23:10:59       1  a_large_text_blob...       2
    2018-10-05 23:11:03       1  a_large_text_blob...       2
    2018-10-08 03:11:32       2  a_large_text_blob...       3
    2018-10-08 03:12:58       2  a_large_text_blob...       3
    2018-10-08 03:13:16       2  a_large_text_blob...       3
    2018-10-08 03:14:04       2  a_large_text_blob...       3
    2018-10-08 03:38:36       2  a_large_text_blob...       4
    2018-10-08 03:38:42       2  a_large_text_blob...       4
    2018-10-08 03:38:52       2  a_large_text_blob...       4
    2018-10-08 03:38:57       2  a_large_text_blob...       4
    2018-10-08 03:39:10       2  a_large_text_blob...       4
    2018-10-08 03:39:27       2               Restart       4
    2018-10-08 03:40:47       2  a_large_text_blob...       5
    2018-10-08 03:40:54       2  a_large_text_blob...       5
    2018-10-08 03:41:02       2  a_large_text_blob...       5
    2018-10-08 03:41:12       2  a_large_text_blob...       5
    2018-10-08 03:41:32       2  a_large_text_blob...       5
    2018-10-08 03:41:39       2  a_large_text_blob...       5
    2018-10-08 03:42:20       2  a_large_text_blob...       5
    2018-10-08 03:44:58       2  a_large_text_blob...       5
    2018-10-08 03:45:54       2  a_large_text_blob...       5
    2018-10-08 03:46:06       2  a_large_text_blob...       5
    2018-10-08 05:06:42       3  a_large_text_blob...       6
    2018-10-08 05:06:53       3  a_large_text_blob...       6
    2018-10-08 05:08:49       3  a_large_text_blob...       6
    2018-10-08 05:08:58       3  a_large_text_blob...       6
    2018-10-08 05:58:18       4  a_large_text_blob...       7
    2018-10-08 05:58:26       4  a_large_text_blob...       7
    2018-10-08 05:58:37       4  a_large_text_blob...       7
    2018-10-08 05:58:58       4  a_large_text_blob...       7
    2018-10-08 06:00:31       4  a_large_text_blob...       7
    2018-10-08 06:01:00       4  a_large_text_blob...       7
    2018-10-08 06:01:14       4  a_large_text_blob...       7
    2018-10-08 06:02:03       4  a_large_text_blob...       7
    2018-10-08 06:02:03       4  a_large_text_blob...       7
    2018-10-08 06:06:03       4  a_large_text_blob...       7
    2018-10-08 06:10:00       4  a_large_text_blob...       7
    2018-10-08 09:07:03       4  a_large_text_blob...       8
    2018-10-08 09:09:03       4  a_large_text_blob...       8
    2018-10-09 10:01:00       4  a_large_text_blob...       9
    2018-10-09 10:02:00       4  a_large_text_blob...       9
    2018-10-09 10:03:00       4  a_large_text_blob...       9
    2018-10-09 10:09:00       4  a_large_text_blob...       9
    2018-10-09 10:09:00       5  a_large_text_blob...      10
    
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
    user_change = df.userID == df.userID.shift().fillna(method='bfill')
    df['new_id'] = (restart | ~user_change).cumsum()
    current_id = 0
    new_id_prev = 0
    start_time = df.timestamp.iloc[0]
    
    for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
        timedelta = timestamp - start_time
    
        if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
            current_id += 1
            start_time = timestamp
    
        new_id_prev = new_id    
        df.new_id.iloc[i] = current_id