Python 如何按标签对时间戳进行分组?
我用DatetimeIndex索引数据帧(系列)Python 如何按标签对时间戳进行分组?,python,pandas,Python,Pandas,我用DatetimeIndex索引数据帧(系列) tag 2015-08-21 16:32:00 stationary 2015-08-21 16:33:00 automotive 2015-08-21 16:34:00 automotive 2015-08-21 17:27:00 stationary 2015-08-21 17:28:00 stationary 2015-08-21 17:29:00 stationary 2015-08-21 17:3
tag
2015-08-21 16:32:00 stationary
2015-08-21 16:33:00 automotive
2015-08-21 16:34:00 automotive
2015-08-21 17:27:00 stationary
2015-08-21 17:28:00 stationary
2015-08-21 17:29:00 stationary
2015-08-21 17:30:00 stationary
2015-08-21 17:31:00 stationary
2015-08-21 17:32:00 stationary
2015-08-24 16:55:00 automotive
2015-08-24 16:56:00 automotive
2015-08-24 16:57:00 automotive
2015-08-24 16:58:00 automotive
2015-08-24 16:59:00 stationary
2015-08-24 17:00:00 stationary
2015-08-24 17:01:00 stationary
希望通过标记和聚合时间索引进行分组,因此预期结果为
Start End Tag
- 2015-08-21 16:32:00 stationary
2015-08-21 16:34:00 2015-08-21 16:34:00 automotive
2015-08-21 17:27:00 2015-08-21 17:32:00 stationary
2015-08-24 16:55:00 2015-08-24 16:58:00 automotive
2015-08-24 16:59:00 2015-08-24 17:01:00 stationary
不确定您的预期结果是否正确,以下是我认为正确的预期结果:
begin end tag
0 2015-08-21 16:32:00 2015-08-21 16:32:00 stationary
1 2015-08-21 16:33:00 2015-08-21 16:34:00 automotive
3 2015-08-21 17:27:00 2015-08-21 17:32:00 stationary
9 2015-08-24 16:55:00 2015-08-24 16:58:00 automotive
13 2015-08-24 16:59:00 2015-08-24 17:01:00 stationary
以下是如何获得它:
import pandas as pd
import numpy as np
from datetime import datetime
# Prepare data from your example
data = [
("2015-08-21 16:32:00", "stationary"),
("2015-08-21 16:33:00", "automotive"),
("2015-08-21 16:34:00", "automotive"),
("2015-08-21 17:27:00", "stationary"),
("2015-08-21 17:28:00", "stationary"),
("2015-08-21 17:29:00", "stationary"),
("2015-08-21 17:30:00", "stationary"),
("2015-08-21 17:31:00", "stationary"),
("2015-08-21 17:32:00", "stationary"),
("2015-08-24 16:55:00", "automotive"),
("2015-08-24 16:56:00", "automotive"),
("2015-08-24 16:57:00", "automotive"),
("2015-08-24 16:58:00", "automotive"),
("2015-08-24 16:59:00", "stationary"),
("2015-08-24 17:00:00", "stationary"),
("2015-08-24 17:01:00", "stationary")]
data = [(datetime.strptime(x[0], "%Y-%m-%d %H:%M:%S"), x[1]) for x in data]
df = pd.DataFrame(data, columns=['ts', 'tag']).sort('ts')
df['is_first'] = df.tag != df.tag.shift()
df['is_last'] = df.tag != df.tag.shift(-1)
# Fill begin timestamp, only on first occurences
df['begin'] = df.ts
df.loc[~df.is_first, 'begin'] = pd.NaT
# Fill end timestamp, only on last occurences
df['end'] = df.ts
df.loc[~df.is_last, 'end'] = pd.NaT
# Fill NaT with next end
df['end'] = df['end'].bfill()
# Restrict to changes
df = df[df.is_first]
# Remove useless columns
df = df[['begin', 'end', 'tag']].sort('begin')
不确定您的预期结果是否正确,以下是我认为正确的预期结果:
begin end tag
0 2015-08-21 16:32:00 2015-08-21 16:32:00 stationary
1 2015-08-21 16:33:00 2015-08-21 16:34:00 automotive
3 2015-08-21 17:27:00 2015-08-21 17:32:00 stationary
9 2015-08-24 16:55:00 2015-08-24 16:58:00 automotive
13 2015-08-24 16:59:00 2015-08-24 17:01:00 stationary
以下是如何获得它:
import pandas as pd
import numpy as np
from datetime import datetime
# Prepare data from your example
data = [
("2015-08-21 16:32:00", "stationary"),
("2015-08-21 16:33:00", "automotive"),
("2015-08-21 16:34:00", "automotive"),
("2015-08-21 17:27:00", "stationary"),
("2015-08-21 17:28:00", "stationary"),
("2015-08-21 17:29:00", "stationary"),
("2015-08-21 17:30:00", "stationary"),
("2015-08-21 17:31:00", "stationary"),
("2015-08-21 17:32:00", "stationary"),
("2015-08-24 16:55:00", "automotive"),
("2015-08-24 16:56:00", "automotive"),
("2015-08-24 16:57:00", "automotive"),
("2015-08-24 16:58:00", "automotive"),
("2015-08-24 16:59:00", "stationary"),
("2015-08-24 17:00:00", "stationary"),
("2015-08-24 17:01:00", "stationary")]
data = [(datetime.strptime(x[0], "%Y-%m-%d %H:%M:%S"), x[1]) for x in data]
df = pd.DataFrame(data, columns=['ts', 'tag']).sort('ts')
df['is_first'] = df.tag != df.tag.shift()
df['is_last'] = df.tag != df.tag.shift(-1)
# Fill begin timestamp, only on first occurences
df['begin'] = df.ts
df.loc[~df.is_first, 'begin'] = pd.NaT
# Fill end timestamp, only on last occurences
df['end'] = df.ts
df.loc[~df.is_last, 'end'] = pd.NaT
# Fill NaT with next end
df['end'] = df['end'].bfill()
# Restrict to changes
df = df[df.is_first]
# Remove useless columns
df = df[['begin', 'end', 'tag']].sort('begin')
您可以使用
groupby
和apply
方案
def func(group):
return pd.Series({'Start': group.index[0], 'End': group.index[-1], 'Tag': group['tag'].values[0]})
df.groupby((df.shift(1) != df).cumsum()['tag'], as_index=False).apply(func)
End Start Tag
0 2015-08-21 16:32:00 2015-08-21 16:32:00 stationary
1 2015-08-21 16:34:00 2015-08-21 16:33:00 automotive
2 2015-08-21 17:32:00 2015-08-21 17:27:00 stationary
3 2015-08-24 16:58:00 2015-08-24 16:55:00 automotive
4 2015-08-24 17:01:00 2015-08-24 16:59:00 stationary
您可以使用
groupby
和apply
方案
def func(group):
return pd.Series({'Start': group.index[0], 'End': group.index[-1], 'Tag': group['tag'].values[0]})
df.groupby((df.shift(1) != df).cumsum()['tag'], as_index=False).apply(func)
End Start Tag
0 2015-08-21 16:32:00 2015-08-21 16:32:00 stationary
1 2015-08-21 16:34:00 2015-08-21 16:33:00 automotive
2 2015-08-21 17:32:00 2015-08-21 17:27:00 stationary
3 2015-08-24 16:58:00 2015-08-24 16:55:00 automotive
4 2015-08-24 17:01:00 2015-08-24 16:59:00 stationary
非常优雅。然而,在大数据帧上测试时,它的运行速度似乎真的很慢:在10000行上使用我的答案是9秒,而不是90毫秒。@Omar同意。此特定的
groupby
操作涉及迭代过多的子组,并在每个组中构建一个新的pd。如果函数调用过多次(如果标记更改过频繁),则每个组中的Series
会引入大量开销。您的方法完全基于矢量化的numpy.array
,在这种情况下速度更快。非常优雅。然而,在大数据帧上测试时,它的运行速度似乎真的很慢:在10000行上使用我的答案是9秒,而不是90毫秒。@Omar同意。此特定的groupby
操作涉及迭代过多的子组,并在每个组中构建一个新的pd。如果函数调用过多次(如果标记更改过频繁),则每个组中的Series
会引入大量开销。您的方法完全基于矢量化的numpy.array
,在这种情况下速度更快。