从时间序列数据帧中选择最新值的Pythonic方法_Python_Pandas_Numpy_Datetime_Time Series

从时间序列数据帧中选择最新值的Pythonic方法

python pandas numpy datetime

从时间序列数据帧中选择最新值的Pythonic方法,python,pandas,numpy,datetime,time-series,Python,Pandas,Numpy,Datetime,Time Series,我有一个时间序列，在每个日期时间包含多个值。每个datetime索引都有一个加载值的关联datetime，或“loadtime”，如下所示： import datetime as dt import numpy as np import pandas as pd # time-series index t = pd.date_range('09/01/2017', '09/02/2017', freq='1H') t = t.repeat(3) n = len(t) # data value

我有一个时间序列，在每个日期时间包含多个值。每个datetime索引都有一个加载值的关联datetime，或“loadtime”，如下所示：

import datetime as dt
import numpy as np
import pandas as pd

# time-series index
t = pd.date_range('09/01/2017', '09/02/2017', freq='1H')
t = t.repeat(3)
n = len(t)

# data values
y = np.full((n), 0.0)
y = y.reshape(n//3, 3)
y[:, 1] = 1.0
y[:, 2] = 2.0
y = y.flatten()

# load timestamp
random_range = np.arange(0, 60)
base_date = np.datetime64('2017-10-01 12:00')
loadtimes = [base_date + np.random.choice(random_range) for x in range(n)]

df = pd.DataFrame(index=t, data={'y': y, 'loadtime': loadtimes})


>>> df.head(12)
                    loadtime            y
2017-09-01 00:00:00 2017-10-02 01:59:00 0.0
2017-09-01 00:00:00 2017-10-02 09:23:00 1.0
2017-09-01 00:00:00 2017-10-02 03:35:00 2.0
2017-09-01 01:00:00 2017-10-01 17:26:00 0.0
2017-09-01 01:00:00 2017-10-01 16:44:00 1.0
2017-09-01 01:00:00 2017-10-02 12:50:00 2.0
2017-09-01 02:00:00 2017-10-02 11:30:00 0.0
2017-09-01 02:00:00 2017-10-02 11:17:00 1.0
2017-09-01 02:00:00 2017-10-01 20:23:00 2.0
2017-09-01 03:00:00 2017-10-02 15:27:00 0.0
2017-09-01 03:00:00 2017-10-02 18:08:00 1.0
2017-09-01 03:00:00 2017-10-01 16:06:00 2.0

到目前为止，我已经提出了迭代所有唯一值的解决方案……但是随着时间序列长度（和多个值）的增加，这可能会很昂贵。它看起来有点像黑客，不太干净：

new_index = df.index.unique()
df_new = pd.DataFrame(index=new_index, columns=['y'])

# cycle through unique indices to find max loadtime
dfg = df.groupby(df.index)
for i, dfg_i in dfg:
    max_index = dfg_i['loadtime'] == dfg_i['loadtime'].max()

    if i in df_new.index:
        df_new.loc[i, 'y'] = dfg_i.loc[max_index, 'y'].values[0]  # WHY IS THIS A LIST?

>>> df_new.head()
                    y
2017-09-01 00:00:00 1
2017-09-01 01:00:00 2
2017-09-01 02:00:00 0
2017-09-01 03:00:00 1
2017-09-01 04:00:00 1

如何为每个唯一索引获取具有最新“加载时间”的时间序列？有更具python风格的解决方案吗？

首先从

DatetimeIndex

创建列，然后通过

列创建列。然后使用每组

loadtime

中最大值的返回索引（此处为

列值）：

print (df.rename_axis('dat')
         .reset_index()
         .set_index('y')
         .groupby('dat')['loadtime']
         .idxmax()
         .to_frame('y'))

                       y
dat                     
2017-09-01 00:00:00  1.0
2017-09-01 01:00:00  2.0
2017-09-01 02:00:00  0.0
2017-09-01 03:00:00  1.0

详情：

print (df.rename_axis('dat')
         .reset_index()
         .set_index('y'))

                    dat            loadtime
y                                          
0.0 2017-09-01 00:00:00 2017-10-02 01:59:00
1.0 2017-09-01 00:00:00 2017-10-02 09:23:00
2.0 2017-09-01 00:00:00 2017-10-02 03:35:00
0.0 2017-09-01 01:00:00 2017-10-01 17:26:00
1.0 2017-09-01 01:00:00 2017-10-01 16:44:00
2.0 2017-09-01 01:00:00 2017-10-02 12:50:00
0.0 2017-09-01 02:00:00 2017-10-02 11:30:00
1.0 2017-09-01 02:00:00 2017-10-02 11:17:00
2.0 2017-09-01 02:00:00 2017-10-01 20:23:00
0.0 2017-09-01 03:00:00 2017-10-02 15:27:00
1.0 2017-09-01 03:00:00 2017-10-02 18:08:00
2.0 2017-09-01 03:00:00 2017-10-01 16:06:00

计时：

t = pd.date_range('01/01/2017', '12/25/2017', freq='1H')
#len(df)
#[25779 rows x 2 columns]

In [225]: %timeit (df.rename_axis('dat').reset_index().set_index('y').groupby('dat')['loadtime'].idxmax().to_frame('y'))
1 loop, best of 3: 870 ms per loop

In [226]: %timeit df.groupby(level=0).apply(lambda x : x.set_index('y').idxmax()).rename(columns={'loadtime':'y'})
1 loop, best of 3: 4.96 s per loop

首先从

DatetimeIndex

创建列，然后通过

列创建列。然后使用每组

loadtime

中最大值的返回索引（此处为

列值）：

print (df.rename_axis('dat')
         .reset_index()
         .set_index('y')
         .groupby('dat')['loadtime']
         .idxmax()
         .to_frame('y'))

                       y
dat                     
2017-09-01 00:00:00  1.0
2017-09-01 01:00:00  2.0
2017-09-01 02:00:00  0.0
2017-09-01 03:00:00  1.0

详情：

print (df.rename_axis('dat')
         .reset_index()
         .set_index('y'))

                    dat            loadtime
y                                          
0.0 2017-09-01 00:00:00 2017-10-02 01:59:00
1.0 2017-09-01 00:00:00 2017-10-02 09:23:00
2.0 2017-09-01 00:00:00 2017-10-02 03:35:00
0.0 2017-09-01 01:00:00 2017-10-01 17:26:00
1.0 2017-09-01 01:00:00 2017-10-01 16:44:00
2.0 2017-09-01 01:00:00 2017-10-02 12:50:00
0.0 2017-09-01 02:00:00 2017-10-02 11:30:00
1.0 2017-09-01 02:00:00 2017-10-02 11:17:00
2.0 2017-09-01 02:00:00 2017-10-01 20:23:00
0.0 2017-09-01 03:00:00 2017-10-02 15:27:00
1.0 2017-09-01 03:00:00 2017-10-02 18:08:00
2.0 2017-09-01 03:00:00 2017-10-01 16:06:00

计时：

t = pd.date_range('01/01/2017', '12/25/2017', freq='1H')
#len(df)
#[25779 rows x 2 columns]

In [225]: %timeit (df.rename_axis('dat').reset_index().set_index('y').groupby('dat')['loadtime'].idxmax().to_frame('y'))
1 loop, best of 3: 870 ms per loop

In [226]: %timeit df.groupby(level=0).apply(lambda x : x.set_index('y').idxmax()).rename(columns={'loadtime':'y'})
1 loop, best of 3: 4.96 s per loop

您可以使用groupby

级别0

并应用，即

ndf = df.groupby(level=0).apply(lambda x : x.set_index('y').idxmax()).rename(columns={'loadtime':'y'})

输出：

ndf.head（）

Y 2017-09-01 00:00:00 1.0 2017-09-01 01:00:00 1.0 2017-09-01 02:00:00 2.0 2017-09-01 03:00:00 1.0 2017-09-01 04:00:00 1.0

您可以使用groupby

级别0

并应用，即

ndf = df.groupby(level=0).apply(lambda x : x.set_index('y').idxmax()).rename(columns={'loadtime':'y'})

输出：

ndf.head（）

Y 2017-09-01 00:00:00 1.0 2017-09-01 01:00:00 1.0 2017-09-01 02:00:00 2.0 2017-09-01 03:00:00 1.0 2017-09-01 04:00:00 1.0

我想我花了很多时间，这是一个非常棘手的问题。你18分钟前就回答了？AwesomeI测试它，你的解决方案很慢：（先生，是的，不能不同意，不用担心你的方法很好。你很好地利用了重置索引。我想我花了很多时间，这是一个非常棘手的问题。你在18分钟前回答了它？AwesomeI测试它，你的解决方案很慢：（先生，是的，我不能不同意，不用担心，你的方法很棒。你很好地利用了重置索引。