Python 求最大折扣连续性的长度
我正在将一些代码从pythonlist原语迁移到pandas实现。对于某些时间序列,我想找出所有不连续的片段及其持续时间。在熊猫身上有干净的方法吗 我的数据框如下所示:Python 求最大折扣连续性的长度,python,numpy,pandas,time-series,Python,Numpy,Pandas,Time Series,我正在将一些代码从pythonlist原语迁移到pandas实现。对于某些时间序列,我想找出所有不连续的片段及其持续时间。在熊猫身上有干净的方法吗 我的数据框如下所示: In [23]: df Out[23]: 2016-07-01 05:35:00 60.466667 2016-07-01 05:40:00 NaN 2016-07-01 05:45:00 NaN 2016-07-01 05:50:00 NaN 2016-07-01 0
In [23]: df
Out[23]:
2016-07-01 05:35:00 60.466667
2016-07-01 05:40:00 NaN
2016-07-01 05:45:00 NaN
2016-07-01 05:50:00 NaN
2016-07-01 05:55:00 NaN
2016-07-01 06:00:00 NaN
2016-07-01 06:05:00 NaN
2016-07-01 06:10:00 NaN
2016-07-01 06:15:00 NaN
2016-07-01 06:20:00 NaN
2016-07-01 06:25:00 NaN
2016-07-01 06:30:00 NaN
2016-07-01 06:35:00 NaN
2016-07-01 06:40:00 NaN
2016-07-01 06:45:00 NaN
2016-07-01 06:50:00 NaN
2016-07-01 06:55:00 NaN
2016-07-01 07:00:00 NaN
2016-07-01 07:05:00 NaN
2016-07-01 07:10:00 NaN
2016-07-01 07:15:00 NaN
2016-07-01 07:20:00 NaN
2016-07-01 07:25:00 NaN
2016-07-01 07:30:00 NaN
2016-07-01 07:35:00 NaN
2016-07-01 07:40:00 NaN
2016-07-01 07:45:00 63.500000
2016-07-01 07:50:00 67.293333
2016-07-01 07:55:00 67.633333
2016-07-01 08:00:00 68.306667
...
2016-07-01 11:20:00 NaN
2016-07-01 11:25:00 NaN
2016-07-01 11:30:00 62.000000
2016-07-01 11:35:00 69.513333
2016-07-01 11:40:00 64.931298
2016-07-01 11:45:00 51.980000
2016-07-01 11:50:00 55.253333
2016-07-01 11:55:00 51.273333
2016-07-01 12:00:00 52.080000
2016-07-01 12:05:00 54.580000
2016-07-01 12:10:00 55.306667
2016-07-01 12:15:00 55.200000
2016-07-01 12:20:00 57.140000
2016-07-01 12:25:00 57.020000
2016-07-01 12:30:00 57.526667
2016-07-01 12:35:00 57.880000
2016-07-01 12:40:00 67.286667
2016-07-01 12:45:00 58.153333
2016-07-01 12:50:00 57.460000
2016-07-01 12:55:00 54.413333
2016-07-01 13:00:00 55.526667
2016-07-01 13:05:00 56.120000
2016-07-01 13:10:00 55.620000
2016-07-01 13:15:00 56.420000
2016-07-01 13:20:00 51.893333
2016-07-01 13:25:00 74.451613
2016-07-01 13:30:00 54.898551
2016-07-01 13:35:00 NaN
2016-07-01 13:40:00 63.355140
2016-07-01 13:45:00 61.000000
Freq: 5T, dtype: float64
其中,例如,第一个不连续事件是从5:40到7:40。只要您有一个系列或单列数据帧,这应该可以工作
>>>pd.Series(df.isnull().index).diff()
可通过以下方式进行改进以获得有用的输出:
MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
discontinuities = pd.Series(df.isnull().index).diff()
discontinuities.sort(ascending=False)
discontinuities[discontinuities > MIN_GAP_TIMEDELTA].size
不像基于pandas的解决方案那样优雅或简洁,但考虑到性能,可以考虑使用NumPy阵列和函数。为了解决这种情况,假设日期时间有一个固定的频率,这里有一个基于NumPy的方法来获得不连续长度,最大长度和阈值计数-
# Get indices of start and stop indices of discontinuities signified by NaNs
idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]
# Do differentiation on those indices which would give us the length of
# intervals of discontinuities. These could be used in various ways.
discontinuity_lens = np.diff(idx.reshape(-1,2),axis=1)
# Max discontinuity length
discontinuity_maxlen = discontinuity_lens.max()
# Count of discontinuities that are greater than a threshold of 30 mins as
# listed with threshold parameter : MIN_GAP_TIMEDELTA = Timedelta(minutes=30)
# (in terms of steps that would be 6 because freq of input dataframe is 5 mins)
thresholded_count = (discontinuity_lens>=6).sum()
请注意,这主要是基于另一个原因
运行时测试
我将在一个足够大的数据帧上发布基于NumPy的计时方法,该数据帧中填充了随机元素,并随机放置了50%的NAN
函数定义:
def thresholdedcount_pandas(df):
MIN_GAP_TIMEDELTA = pd.Timedelta(minutes=30)
discontinuities = df.dropna().reset_index()['index'].diff()
return (discontinuities > MIN_GAP_TIMEDELTA).sum()
def thresholdedcount_numpy(df):
idx = np.where(np.diff(np.hstack(([False],np.isnan(df[0]),[False]))))[0]
nan_interval_lens = np.diff(idx.reshape(-1,2),axis=1)
return (nan_interval_lens>=6).sum()
时间:
In [325]: # Random dataframe with 5 min interval data and filled with 50% NaNs
...: rng = pd.date_range('1/1/2011', periods=10000, freq='5Min')
...: df = pd.DataFrame(np.random.randn(len(rng)), index=rng)
...: df[0][np.random.randint(0,df.shape[0],(int(df.shape[0]/2)))] = np.nan
...:
In [326]: np.allclose(thresholdedcount_pandas(df),thresholdedcount_numpy(df))
Out[326]: True
In [327]: %timeit thresholdedcount_pandas(df)
100 loops, best of 3: 3 ms per loop
In [328]: %timeit thresholdedcount_numpy(df)
1000 loops, best of 3: 318 µs per loop
这看起来像一个系列,而不是一个数据帧。我们将研究这个解决方案。现在我更倾向于使用一个更简单的解决方案,因为性能对于这些人来说并不是一个大问题,因为它主要是在后台工作上。