Python 3.x 重新格式化数据帧以在groupby之后显示序列号和时间差_Python 3.x_Pandas_Pandas Groupby_Python Datetime

Python 3.x 重新格式化数据帧以在groupby之后显示序列号和时间差

python-3.x pandas

Python 3.x 重新格式化数据帧以在groupby之后显示序列号和时间差,python-3.x,pandas,pandas-groupby,python-datetime,Python 3.x,Pandas,Pandas Groupby,Python Datetime,我有一个数据帧，它有一个标识符、一个序列号和一个时间戳例如： MyIndex seq_no timestamp 1 181 7:56 1 182 7:57 1 183 7:59 2 184 8:01 2 185 8:04 3 186 8:05 3 187 8:

我有一个数据帧，它有一个标识符、一个序列号和一个时间戳

例如：

MyIndex     seq_no    timestamp
1          181        7:56
1          182        7:57
1          183        7:59
2          184        8:01
2          185        8:04
3          186        8:05
3          187        8:08
3          188        8:10

我想通过显示每个索引的序列号和时间差来重新格式化，类似于：

MyIndex     seq_no    timediff
1          1        0
1          2        1
1          3        2
2          1        0
2          2        3
3          1        0
3          2        3
3          3        2

我知道我可以通过这样做得到序号

df.groupby("MyIndex")["seq_no"].rank(method="first", ascending=True)

但是我如何得到时差呢？如果您向我演示如何从一开始计算步骤之间的时差或总时差，则会获得额外的积分。

我认为获得时差的最简单方法是将时间戳转换为单个单位。然后可以使用groupby和shift计算差异

import pandas as pd
from io import StringIO

data = """Index     seq_no    timestamp
1          181        7:56
1          182        7:57
1          183        7:59
2          184        8:01
2          185        8:04
3          186        8:05
3          187        8:08
3          188        8:10"""

df = pd.read_csv(StringIO(data), sep='\s+')

# use cumcount to get new seq_no
df['seq_no_new'] = df.groupby('Index').cumcount() + 1

# can convert timestamp by splitting string
# and then casting to int
time = df['timestamp'].str.split(':', expand=True).astype(int)
df['time'] = time.iloc[:, 0] * 60 + time.iloc[:, 1]

# you then calculate the difference with groupby/shift
# fillna values with 0 and cast to int
df['timediff'] = (df['time'] - df.groupby('Index')['time'].shift(1)).fillna(0).astype(int)

# pick columns you want at the end
df = df.loc[:, ['Index', 'seq_no_new', 'timediff']]

输出

>>>df

   Index  seq_no_new  timediff
0      1           1         0
1      1           2         1
2      1           3         2
3      2           1         0
4      2           2         3
5      3           1         0
6      3           2         3
7      3           3         2

Index

是一列还是实际上是你的索引，很抱歉混淆了，我将以

hh:mm

或

mm:ss

的格式重命名你的时间戳？它是一个完整的日期时间戳：dd-mm-yyy-hh:mm:s我可以通过移动MyIndex的值来获得与初始值的时间差。价值