添加新的';步骤';Python/Pandas中每次具有多条记录的timeseries数据的值列
我试图分配一个新的df列“step”,其中添加新的';步骤';Python/Pandas中每次具有多条记录的timeseries数据的值列,python,python-3.x,pandas,dataframe,time-series,Python,Python 3.x,Pandas,Dataframe,Time Series,我试图分配一个新的df列“step”,其中df['step']中每一行的值在不同列(“time”)中的每个唯一值递增。时间列按升序排列,标记id的顺序并不重要。每个唯一时间戳可能具有不同数量的唯一标记id值,但所有时间值都有规则的间隔,间隔为00:00:00:05 数据集看起来像这样,带有时间戳,每次都有多个具有x和y位置的唯一标记id tag_id x_pos y_pos time 0 1 77.134000 70.65
df['step']
中每一行的值在不同列(“time”)中的每个唯一值递增。时间列按升序排列,标记id的顺序并不重要。每个唯一时间戳可能具有不同数量的唯一标记id值,但所有时间值都有规则的间隔,间隔为00:00:00:05
数据集看起来像这样,带有时间戳,每次都有多个具有x和y位置的唯一标记id
tag_id x_pos y_pos time
0 1 77.134000 70.651000 19:03:51
1 2 66.376432 34.829683 19:03:51
2 3 49.250835 37.848381 19:03:51
3 1 50.108018 7.670564 19:03:51.050000
4 2 54.919299 47.613906 19:03:51.050000
5 3 57.584265 38.440233 19:03:51.050000
6 1 47.862124 29.133489 19:03:51.100000
7 2 71.092900 71.650500 19:03:51.100000
8 3 65.704667 25.856978 19:03:51.100000
9 1 62.680708 13.710716 19:03:51.150000
10 2 65.673670 47.574349 19:03:51.150000
11 3 77.134000 70.651000 19:03:51.150000
12 1 66.410406 34.792751 19:03:51.200000
13 2 49.306861 37.714626 19:03:51.200000
14 3 50.142578 7.575307 19:03:51.200000
15 1 54.940298 47.528109 19:03:51.250000
我使用掩码为df['time']
中的每个唯一值创建了以下函数,该函数有效,但速度非常慢(原始数据集约500000条记录,41000个唯一时间)
给予:
tag_id x_pos y_pos time step
0 1 77.134000 70.651000 19:03:51 0
1 2 66.376432 34.829683 19:03:51 0
2 3 49.250835 37.848381 19:03:51 0
3 1 50.108018 7.670564 19:03:51.050000 1
4 2 54.919299 47.613906 19:03:51.050000 1
5 3 57.584265 38.440233 19:03:51.050000 1
6 1 47.862124 29.133489 19:03:51.100000 2
7 2 71.092900 71.650500 19:03:51.100000 2
8 3 65.704667 25.856978 19:03:51.100000 2
9 1 62.680708 13.710716 19:03:51.150000 3
10 2 65.673670 47.574349 19:03:51.150000 3
11 3 77.134000 70.651000 19:03:51.150000 3
12 1 66.410406 34.792751 19:03:51.200000 4
13 2 49.306861 37.714626 19:03:51.200000 4
14 3 50.142578 7.575307 19:03:51.200000 4
15 1 54.940298 47.528109 19:03:51.250000 5
有没有更有效的方法来实现这一结果?谢谢大家! 试试这个
import numpy as np
import pandas as pd
df = pd.read_csv('data.txt', delim_whitespace=True, parse_dates=['time'])
df['step'] = df['time']-df['time'].shift(1) #shift index and find difference
zero = np.timedelta64(0, 's')
df['step'][0] = np.timedelta64(0, 's') #change first var from naT to zero
df['step'] = df['step'].apply(lambda x: x>zero).cumsum()
print(df)
产生
tag_id x_pos y_pos time step
0 1 77.134000 70.651000 2020-02-16 19:03:51.000 0
1 2 66.376432 34.829683 2020-02-16 19:03:51.000 0
2 3 49.250835 37.848381 2020-02-16 19:03:51.000 0
3 1 50.108018 7.670564 2020-02-16 19:03:51.050 1
4 2 54.919299 47.613906 2020-02-16 19:03:51.050 1
5 3 57.584265 38.440233 2020-02-16 19:03:51.050 1
6 1 47.862124 29.133489 2020-02-16 19:03:51.100 2
7 2 71.092900 71.650500 2020-02-16 19:03:51.100 2
8 3 65.704667 25.856978 2020-02-16 19:03:51.100 2
9 1 62.680708 13.710716 2020-02-16 19:03:51.150 3
10 2 65.673670 47.574349 2020-02-16 19:03:51.150 3
11 3 77.134000 70.651000 2020-02-16 19:03:51.150 3
12 1 66.410406 34.792751 2020-02-16 19:03:51.200 4
13 2 49.306861 37.714626 2020-02-16 19:03:51.200 4
14 3 50.142578 7.575307 2020-02-16 19:03:51.200 4
15 1 54.940298 47.528109 2020-02-16 19:03:51.250 5
试试这个
import numpy as np
import pandas as pd
df = pd.read_csv('data.txt', delim_whitespace=True, parse_dates=['time'])
df['step'] = df['time']-df['time'].shift(1) #shift index and find difference
zero = np.timedelta64(0, 's')
df['step'][0] = np.timedelta64(0, 's') #change first var from naT to zero
df['step'] = df['step'].apply(lambda x: x>zero).cumsum()
print(df)
产生
tag_id x_pos y_pos time step
0 1 77.134000 70.651000 2020-02-16 19:03:51.000 0
1 2 66.376432 34.829683 2020-02-16 19:03:51.000 0
2 3 49.250835 37.848381 2020-02-16 19:03:51.000 0
3 1 50.108018 7.670564 2020-02-16 19:03:51.050 1
4 2 54.919299 47.613906 2020-02-16 19:03:51.050 1
5 3 57.584265 38.440233 2020-02-16 19:03:51.050 1
6 1 47.862124 29.133489 2020-02-16 19:03:51.100 2
7 2 71.092900 71.650500 2020-02-16 19:03:51.100 2
8 3 65.704667 25.856978 2020-02-16 19:03:51.100 2
9 1 62.680708 13.710716 2020-02-16 19:03:51.150 3
10 2 65.673670 47.574349 2020-02-16 19:03:51.150 3
11 3 77.134000 70.651000 2020-02-16 19:03:51.150 3
12 1 66.410406 34.792751 2020-02-16 19:03:51.200 4
13 2 49.306861 37.714626 2020-02-16 19:03:51.200 4
14 3 50.142578 7.575307 2020-02-16 19:03:51.200 4
15 1 54.940298 47.528109 2020-02-16 19:03:51.250 5
请将问答格式保留为SO,不要将答案张贴在问题帖子内请将问答格式保留为SO,不要将答案张贴在问题帖子内