Python:不知道确切模式的插值时间
我有一个dataframe(多索引),它包含一个datetime列,其中缺少值(NaT)。时间以0.2秒的间隔收集,但以秒为单位保存。因此,如果不是NaT,数据帧每秒将有5个副本,例如Python:不知道确切模式的插值时间,python,pandas,datetime,interpolation,multi-index,Python,Pandas,Datetime,Interpolation,Multi Index,我有一个dataframe(多索引),它包含一个datetime列,其中缺少值(NaT)。时间以0.2秒的间隔收集,但以秒为单位保存。因此,如果不是NaT,数据帧每秒将有5个副本,例如 2020-09-30 14:18:44 #for 14:18:44:00 2020-09-30 14:18:44 #for 14:18:44:20 2020-09-30 14:18:44 #for 14:18:44:40 2020-09-30 14:18:44 #for 14:18:44:60 2020-09-
2020-09-30 14:18:44 #for 14:18:44:00
2020-09-30 14:18:44 #for 14:18:44:20
2020-09-30 14:18:44 #for 14:18:44:40
2020-09-30 14:18:44 #for 14:18:44:60
2020-09-30 14:18:44 #for 14:18:44:80
2020-09-30 14:18:45 #for 14:18:45:00
然而,生活并不容易。因此,我的数据帧如下所示,我想在DT列中插入时间。我知道的唯一一件事是,每个组(Gr列)的测量是连续的,时间间隔为0.2秒。我不知道的是:
- 在哪一分秒开始对组进行测量
- 保存了哪个分秒(一个值可能代表0.2秒,而下一个保存的输出可能是0.6秒)
No
其中x%5==0
表示0.0s),因为x%5==0
可能表示0.0、0.2、0.4、0.6或0.8s。
第二个项目符号阻止我从组的第一次出现开始按时间间隔填充NaT值。这是我尝试过的(最后的代码示例),但它不适用于例如B组,因为输出如下所示:
B 18 4 NaT
19 7 NaT
20 11 NaT
21 3 NaT
22 7 NaT
23 5 2020-09-30 14:30:43:00
24 23 2020-09-30 14:30:43:02
25 1 2020-09-30 14:30:43:04
26 9 2020-09-30 14:30:43:06 #missing 0.8 sec. here
27 2 2020-09-30 14:30:44:00
28 4 2020-09-30 14:30:44:02
下面我提供的示例数据并不多,但我希望它足以说明我的问题。我的原始数据每个组都有>100行(索引级别=0),因此我确信有可能找出模式,我只是不知道如何做
样本数据,其中:Gr
-组多索引级别=0否
-测量ID,例如,对于一组,可以从18开始(这意味着0-17被排除在进一步处理之外),多指标水平=1x
-测量值DT
-测量日期和时间
x DT
Gr No
A 1 2 2020-09-30 14:18:43
2 4 NaT
3 5 NaT
4 2 NaT
5 4 NaT
6 6 2020-09-30 14:18:44
7 9 NaT
8 9 NaT
9 9 NaT
10 9 NaT
11 1 2020-09-30 14:18:45
12 2 NaT
13 6 NaT
14 8 NaT
15 22 NaT
B 18 4 NaT
19 7 NaT
20 11 NaT
21 3 NaT
22 7 NaT
23 5 2020-09-30 14:30:43
24 23 NaT
25 1 NaT
26 9 NaT
27 2 2020-09-30 14:30:44
28 4 NaT
29 3 NaT
30 11 NaT
31 15 NaT
32 20 NaT
C 0 13 NaT
1 6 2020-09-30 14:48:53
2 22 NaT
3 26 NaT
4 2 NaT
5 7 NaT
6 3 2020-09-30 14:48:54
7 6 NaT
8 1 NaT
9 9 NaT
10 2 NaT
11 14 2020-09-30 14:48:55
12 24 NaT
13 20 NaT
14 5 NaT
样本数据:
data = {
'Group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'No': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
'x': [2, 4, 5, 2, 4, 6, 9, 9, 9, 9, 1, 2, 6, 8, 22, 4, 7, 11, 3, 7, 5, 23, 1, 9, 2, 4, 3, 11, 15, 20, 13, 6, 22, 26, 2, 7, 3, 6, 1, 9, 2, 14, 24, 20, 5, 3, 6, 9, 22, 15, 4, 21, 15, 12, 10, 12, 5, 8, 1, 7, 24, 2, 19, 6, 9, 23, 26, 21, 13, 3, 9, 12, 9, 13, 18, 14, 20, 9, 8, 20, 7, 3, 1, 7, 11, 6, 5, 2, 9, 3],
'DT': ['2020-09-30 17:18:43', None, None, None, None, '2020-09-30 17:18:44', None, None, None, None, '2020-09-30 17:18:45', None, None, None, None, '2020-09-30 17:18:46', None, None, None, None, '2020-09-30 17:18:47', None, None, None, None, '2020-09-30 17:18:48', None, None, None, '2020-09-30 17:18:49', None, None, None, None, None, '2020-09-30 17:30:43', None, None, None, '2020-09-30 17:30:44', None, None, None, None, None, '2020-09-30 17:30:45', None, None, None, None, '2020-09-30 17:30:46', None, None, None, None, '2020-09-30 17:30:47', None, None, None, None, None, '2020-09-30 17:48:53', None, None, None, None, '2020-09-30 17:48:54', None, None, None, None, '2020-09-30 17:48:55', None, None, None, None, '2020-09-30 17:48:56', None, None, None, None, '2020-09-30 17:48:57', None, None, None, None, '2020-09-30 17:48:58', None, None, None]
}
df = pd.DataFrame.from_dict(data)
df = df.set_index(keys = ["Group", "No"])
df["DT"] = pd.to_datetime(df["DT"])
还有一段代码,我用它来计算时间间隔,假设第一次出现是0.0s的测量值。这是错误的。b以类似的方式填充代码
mask = df["DT"].notna() #bool for NaT
g = df["DT"].groupby([pd.Grouper(level = 0), mask.cumsum()]) #group by Group, cumsum for NaT-bool
t = pd.to_timedelta(g.cumcount() * 0.20, unit = "s") #calculate time interval
df["DT"] = df['DT'].groupby(level = 0).ffill() + t #ffill