添加新的';步骤';Python/Pandas中每次具有多条记录的timeseries数据的值列

添加新的';步骤';Python/Pandas中每次具有多条记录的timeseries数据的值列,python,python-3.x,pandas,dataframe,time-series,Python,Python 3.x,Pandas,Dataframe,Time Series,我试图分配一个新的df列“step”,其中df['step']中每一行的值在不同列(“time”)中的每个唯一值递增。时间列按升序排列,标记id的顺序并不重要。每个唯一时间戳可能具有不同数量的唯一标记id值,但所有时间值都有规则的间隔,间隔为00:00:00:05 数据集看起来像这样,带有时间戳,每次都有多个具有x和y位置的唯一标记id tag_id x_pos y_pos time 0 1 77.134000 70.65

我试图分配一个新的df列“step”,其中
df['step']
中每一行的值在不同列(“time”)中的每个唯一值递增。时间列按升序排列,标记id的顺序并不重要。每个唯一时间戳可能具有不同数量的唯一标记id值,但所有时间值都有规则的间隔,间隔为00:00:00:05

数据集看起来像这样,带有时间戳,每次都有多个具有x和y位置的唯一标记id

    tag_id      x_pos      y_pos             time  
0        1  77.134000  70.651000         19:03:51 
1        2  66.376432  34.829683         19:03:51     
2        3  49.250835  37.848381         19:03:51     
3        1  50.108018   7.670564  19:03:51.050000     
4        2  54.919299  47.613906  19:03:51.050000     
5        3  57.584265  38.440233  19:03:51.050000     
6        1  47.862124  29.133489  19:03:51.100000     
7        2  71.092900  71.650500  19:03:51.100000     
8        3  65.704667  25.856978  19:03:51.100000     
9        1  62.680708  13.710716  19:03:51.150000     
10       2  65.673670  47.574349  19:03:51.150000     
11       3  77.134000  70.651000  19:03:51.150000     
12       1  66.410406  34.792751  19:03:51.200000     
13       2  49.306861  37.714626  19:03:51.200000     
14       3  50.142578   7.575307  19:03:51.200000     
15       1  54.940298  47.528109  19:03:51.250000     
我使用掩码为
df['time']
中的每个唯一值创建了以下函数,该函数有效,但速度非常慢(原始数据集约500000条记录,41000个唯一时间)

给予:

    tag_id      x_pos      y_pos             time  step  
0        1  77.134000  70.651000         19:03:51     0
1        2  66.376432  34.829683         19:03:51     0
2        3  49.250835  37.848381         19:03:51     0
3        1  50.108018   7.670564  19:03:51.050000     1
4        2  54.919299  47.613906  19:03:51.050000     1
5        3  57.584265  38.440233  19:03:51.050000     1
6        1  47.862124  29.133489  19:03:51.100000     2
7        2  71.092900  71.650500  19:03:51.100000     2
8        3  65.704667  25.856978  19:03:51.100000     2
9        1  62.680708  13.710716  19:03:51.150000     3
10       2  65.673670  47.574349  19:03:51.150000     3
11       3  77.134000  70.651000  19:03:51.150000     3
12       1  66.410406  34.792751  19:03:51.200000     4
13       2  49.306861  37.714626  19:03:51.200000     4
14       3  50.142578   7.575307  19:03:51.200000     4
15       1  54.940298  47.528109  19:03:51.250000     5
有没有更有效的方法来实现这一结果?谢谢大家!

试试这个

import numpy as np
import pandas as pd

df = pd.read_csv('data.txt', delim_whitespace=True, parse_dates=['time'])
df['step'] = df['time']-df['time'].shift(1)     #shift index and find difference
zero = np.timedelta64(0, 's')       
df['step'][0] = np.timedelta64(0, 's')          #change first var from naT to zero
df['step'] = df['step'].apply(lambda x: x>zero).cumsum()
print(df)
产生

    tag_id      x_pos      y_pos                    time  step
0        1  77.134000  70.651000 2020-02-16 19:03:51.000     0
1        2  66.376432  34.829683 2020-02-16 19:03:51.000     0
2        3  49.250835  37.848381 2020-02-16 19:03:51.000     0
3        1  50.108018   7.670564 2020-02-16 19:03:51.050     1
4        2  54.919299  47.613906 2020-02-16 19:03:51.050     1
5        3  57.584265  38.440233 2020-02-16 19:03:51.050     1
6        1  47.862124  29.133489 2020-02-16 19:03:51.100     2
7        2  71.092900  71.650500 2020-02-16 19:03:51.100     2
8        3  65.704667  25.856978 2020-02-16 19:03:51.100     2
9        1  62.680708  13.710716 2020-02-16 19:03:51.150     3
10       2  65.673670  47.574349 2020-02-16 19:03:51.150     3
11       3  77.134000  70.651000 2020-02-16 19:03:51.150     3
12       1  66.410406  34.792751 2020-02-16 19:03:51.200     4
13       2  49.306861  37.714626 2020-02-16 19:03:51.200     4
14       3  50.142578   7.575307 2020-02-16 19:03:51.200     4
15       1  54.940298  47.528109 2020-02-16 19:03:51.250     5
试试这个

import numpy as np
import pandas as pd

df = pd.read_csv('data.txt', delim_whitespace=True, parse_dates=['time'])
df['step'] = df['time']-df['time'].shift(1)     #shift index and find difference
zero = np.timedelta64(0, 's')       
df['step'][0] = np.timedelta64(0, 's')          #change first var from naT to zero
df['step'] = df['step'].apply(lambda x: x>zero).cumsum()
print(df)
产生

    tag_id      x_pos      y_pos                    time  step
0        1  77.134000  70.651000 2020-02-16 19:03:51.000     0
1        2  66.376432  34.829683 2020-02-16 19:03:51.000     0
2        3  49.250835  37.848381 2020-02-16 19:03:51.000     0
3        1  50.108018   7.670564 2020-02-16 19:03:51.050     1
4        2  54.919299  47.613906 2020-02-16 19:03:51.050     1
5        3  57.584265  38.440233 2020-02-16 19:03:51.050     1
6        1  47.862124  29.133489 2020-02-16 19:03:51.100     2
7        2  71.092900  71.650500 2020-02-16 19:03:51.100     2
8        3  65.704667  25.856978 2020-02-16 19:03:51.100     2
9        1  62.680708  13.710716 2020-02-16 19:03:51.150     3
10       2  65.673670  47.574349 2020-02-16 19:03:51.150     3
11       3  77.134000  70.651000 2020-02-16 19:03:51.150     3
12       1  66.410406  34.792751 2020-02-16 19:03:51.200     4
13       2  49.306861  37.714626 2020-02-16 19:03:51.200     4
14       3  50.142578   7.575307 2020-02-16 19:03:51.200     4
15       1  54.940298  47.528109 2020-02-16 19:03:51.250     5

请将问答格式保留为SO,不要将答案张贴在问题帖子内请将问答格式保留为SO,不要将答案张贴在问题帖子内