Python用线性插值正则化不规则时间序列_Python_Pandas_Time Series_Linear Interpolation

Python用线性插值正则化不规则时间序列

python pandas

Python用线性插值正则化不规则时间序列,python,pandas,time-series,linear-interpolation,Python,Pandas,Time Series,Linear Interpolation,我有一个熊猫的时间序列，看起来像这样： Values 1992-08-27 07:46:48 28.0 1992-08-27 08:00:48 28.2 1992-08-27 08:33:48 28.4 1992-08-27 08:43:48 28.8 1992-08-27 08:48:48 29.0 1992-08-27 08:51:48 29.2 1992-08-27 08:53:48

我有一个熊猫的时间序列，看起来像这样：

                     Values
1992-08-27 07:46:48    28.0  
1992-08-27 08:00:48    28.2  
1992-08-27 08:33:48    28.4  
1992-08-27 08:43:48    28.8  
1992-08-27 08:48:48    29.0  
1992-08-27 08:51:48    29.2  
1992-08-27 08:53:48    29.6  
1992-08-27 08:56:48    29.8  
1992-08-27 09:03:48    30.0

我想把它重采样到一个有15分钟时间步长的常规时间序列，其中的值是线性插值的。基本上，我想得到：

                     Values
1992-08-27 08:00:00    28.2  
1992-08-27 08:15:00    28.3  
1992-08-27 08:30:00    28.4  
1992-08-27 08:45:00    28.8  
1992-08-27 09:00:00    29.9

然而，使用熊猫的重采样方法（df.resample（'15Min'））我得到：

                     Values
1992-08-27 08:00:00   28.20  
1992-08-27 08:15:00     NaN  
1992-08-27 08:30:00   28.60  
1992-08-27 08:45:00   29.40  
1992-08-27 09:00:00   30.00

我用不同的“how”和“fill_method”参数尝试了重采样方法，但从未得到我想要的结果。我用错方法了吗

我想这是一个相当简单的查询，但我在网上搜索了一段时间，却找不到答案

提前感谢您对我的帮助。

这需要一些工作，但请尝试一下。基本思想是找到距离每个重采样点最近的两个时间戳并插值<代码>np。searchsorted用于查找最接近重采样点的日期

# empty frame with desired index
rs = pd.DataFrame(index=df.resample('15min').iloc[1:].index)

# array of indexes corresponding with closest timestamp after resample
idx_after = np.searchsorted(df.index.values, rs.index.values)

# values and timestamp before/after resample
rs['after'] = df.loc[df.index[idx_after], 'Values'].values
rs['before'] = df.loc[df.index[idx_after - 1], 'Values'].values
rs['after_time'] = df.index[idx_after]
rs['before_time'] = df.index[idx_after - 1]

#calculate new weighted value
rs['span'] = (rs['after_time'] - rs['before_time'])
rs['after_weight'] = (rs['after_time'] - rs.index) / rs['span']
# I got errors here unless I turn the index to a series
rs['before_weight'] = (pd.Series(data=rs.index, index=rs.index) - rs['before_time']) / rs['span']

rs['Values'] = rs.eval('before * before_weight + after * after_weight')

在所有这些之后，我希望能找到正确的答案：

In [161]: rs['Values']
Out[161]: 
1992-08-27 08:00:00    28.011429
1992-08-27 08:15:00    28.313939
1992-08-27 08:30:00    28.223030
1992-08-27 08:45:00    28.952000
1992-08-27 09:00:00    29.908571
Freq: 15T, Name: Values, dtype: float64

你可以用它来做这件事。首先，用不规则的测量值创建一个

TimeSeries

，就像创建字典一样：

ts = traces.TimeSeries([
    (datetime(1992, 8, 27, 7, 46, 48), 28.0),
    (datetime(1992, 8, 27, 8, 0, 48), 28.2),
    ...
    (datetime(1992, 8, 27, 9, 3, 48), 30.0),
])

然后使用

sample

方法进行正则化：

ts.sample(
    sampling_period=timedelta(minutes=15),
    start=datetime(1992, 8, 27, 8),
    end=datetime(1992, 8, 27, 9),
    interpolate='linear',
)

这将产生以下正则化版本，其中灰点是原始数据，橙色是具有线性插值的正则化版本

插值为：

1992-08-27 08:00:00    28.189 
1992-08-27 08:15:00    28.286  
1992-08-27 08:30:00    28.377
1992-08-27 08:45:00    28.848
1992-08-27 09:00:00    29.891

@mstringer获得的相同结果完全可以在熊猫身上实现。诀窍是首先按秒重新采样，使用插值填充中间值（

.resample（'s'）.interpolate（）

），然后在15分钟内增加采样（

.resample（'15T'）.asfreq（）

）

输出：

>>> print(comb_series[new_index])
1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, dtype: float64

1992-08-2708:00:0028.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
频率：15T，名称：值，数据类型：float64

我最近不得不对非均匀采样的加速度数据进行重新采样。它通常以正确的频率进行采样，但有间歇性的累积延迟

我发现了这个问题，并结合了mstringer和Alberto Garcia Rabosco的答案，使用了纯熊猫和numpy。该方法在所需频率处创建新索引，然后进行插值，而无需在更高频率处进行间歇性插值

# from Alberto Garcia-Rabosco above
import io
import pandas as pd

data = io.StringIO('''\
Values
1992-08-27 07:46:48,28.0  
1992-08-27 08:00:48,28.2  
1992-08-27 08:33:48,28.4  
1992-08-27 08:43:48,28.8  
1992-08-27 08:48:48,29.0  
1992-08-27 08:51:48,29.2  
1992-08-27 08:53:48,29.6  
1992-08-27 08:56:48,29.8  
1992-08-27 09:03:48,30.0
''')
s = pd.read_csv(data, squeeze=True)
s.index = pd.to_datetime(s.index)

执行插值的代码：

import numpy as np
# create the new index and a new series full of NaNs
new_index = pd.DatetimeIndex(start='1992-08-27 08:00:00', 
    freq='15 min', periods=5, yearfirst=True)
new_series = pd.Series(np.nan, index=new_index)

# concat the old and new series and remove duplicates (if any) 
comb_series = pd.concat([s, new_series])
comb_series = comb_series[~comb_series.index.duplicated(keep='first')]

# interpolate to fill the NaNs
comb_series.interpolate(method='time', inplace=True)

输出：

>>> print(comb_series[new_index])
1992-08-27 08:00:00    28.188571
1992-08-27 08:15:00    28.286061
1992-08-27 08:30:00    28.376970
1992-08-27 08:45:00    28.848000
1992-08-27 09:00:00    29.891429
Freq: 15T, dtype: float64

和以前一样，您可以使用scipy支持的任何插值方法，这种技术也适用于数据帧（这就是我最初使用它的目的）。最后，请注意，“插值”默认为“线性”方法，该方法忽略索引中的时间信息，不适用于非等距数据

太好了！我刚刚将最后一行更改为：rs['Values']=rs.eval（'after*before\u weight+before*after\u weight'），现在它正在像我所希望的那样进行线性插值。谢谢。谢谢，我用-老派：）@mstringer这个方法是上帝派来的！谢谢你与我们分享@mstringer谢谢你的跟踪！这种方法对于间隔不均匀的时间序列非常有用。你能告诉我你是如何制作上述图表的吗？你用过xmgrace吗？您知道有哪些库可以帮助我用Python重新创建上述内容吗？虽然效率低下，但仍然很聪明和有用。