Python 尝试基于三个条件创建新id列时出现问题?
我有一个带有对话和时间戳的数据帧,如下所示:Python 尝试基于三个条件创建新id列时出现问题?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个带有对话和时间戳的数据帧,如下所示: timestamp userID textBlob new_id 2018-10-05 23:07:02 01 a large text blob... 2018-10-05 23:07:13 01 a large text blob... 2018-10-05 23:07:23 01 a large text blob... 2018-10-05 23:07:36 01 a large text blob... 2018-10-
timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...
目前,我想用一个id标识数据帧内的对话。问题是一个用户可以有多个对话(即一个userID
可以关联多个textBlob
)。因此,我想添加一个新的\u id
,以便能够识别上述数据帧内的对话
为此,我想基于三个标准创建一个新的\u id
列:
(*)
:
到目前为止,我试图:
searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))
及
但是,我也需要考虑USER ID,最后我有几个列。是否有任何方法可以满足这三个条件并获得预期的输出
(*)
。若要将其放在一起,请为每个条件构建一个布尔掩码,然后将掩码转换为int并获取其累积和:
mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm')
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'
df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()
# Result:
print(df.to_string(index=False))
timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10
好的,我认为10分钟的时间应该从对话开始算起,而不是从下面的消息算起,在这种情况下,您需要迭代如下行:
df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]
for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
timedelta = timestamp - start_time
if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
current_id += 1
start_time = timestamp
new_id_prev = new_id
df.new_id.iloc[i] = current_id
由于
userID
从01
更改为02
,因此是否应在2018-10-05 23:11:03
和2018-10-08 03:11:32
行之间分配新的id?另外,为什么新ID从005
跳到007
?感谢@PeterLeimbigler的帮助,不,我在生成数据时出错了。。我又把它修好了,汉克斯。现在从008
跳到010
,我上面提到的行仍然没有用户ID增量。@tumbleweed,我已经调整了重新启动
逻辑,所以这个输出应该是您要找的。如果我还缺少什么,请告诉我。谢谢你的帮助!
mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm')
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'
df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()
# Result:
print(df.to_string(index=False))
timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10
df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]
for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
timedelta = timestamp - start_time
if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
current_id += 1
start_time = timestamp
new_id_prev = new_id
df.new_id.iloc[i] = current_id