Python 仅选择每连续分钟有数据的熊猫中的日期范围

Python 仅选择每连续分钟有数据的熊猫中的日期范围,python,datetime,pandas,dataframe,Python,Datetime,Pandas,Dataframe,我正在尝试处理pandas中的一些数据,这些数据在CSV中看起来像这样: 2014.01.02,08:56,1.37549,1.37552,1.37549,1.37552,3 2014.01.02,09:00,1.37562,1.37562,1.37545,1.37545,21 2014.01.02,09:01,1.37545,1.37550,1.37542,1.37546,18 2014.01.02,09:02,1.37546,1.37550,1.37546,1.37546,15 2014.0

我正在尝试处理pandas中的一些数据,这些数据在CSV中看起来像这样:

2014.01.02,08:56,1.37549,1.37552,1.37549,1.37552,3
2014.01.02,09:00,1.37562,1.37562,1.37545,1.37545,21
2014.01.02,09:01,1.37545,1.37550,1.37542,1.37546,18
2014.01.02,09:02,1.37546,1.37550,1.37546,1.37546,15
2014.01.02,09:03,1.37546,1.37563,1.37546,1.37559,39
2014.01.02,09:04,1.37559,1.37562,1.37555,1.37561,37
2014.01.02,09:05,1.37561,1.37564,1.37558,1.37561,35
2014.01.02,09:06,1.37561,1.37566,1.37558,1.37563,38
2014.01.02,09:07,1.37563,1.37567,1.37561,1.37566,42
2014.01.02,09:08,1.37570,1.37571,1.37564,1.37566,25
我使用以下方法导入它:

raw_data = pd.read_csv('raw_data.csv', engine='c', header=None, index_col=0, names=['date', 'time', 'open', 'high', 'low', 'close', 'volume'], parse_dates=[[0,1]])
但现在我想从数据中提取一些随机(甚至连续)样本,但只提取那些我有连续5分钟的数据的样本。因此,例如,
2014.01.02,08:56
中的数据无法使用,因为它有一个缺口。但是
2014.01.02,09:00
中的数据是正常的,因为它在接下来的5分钟内始终具有连续数据


关于如何以有效的方式实现这一点,有什么建议吗?

这里有一种方法,首先
.asfreq('T')
填充一些
NaNs
,然后使用
rolling\u apply
计算最近或接下来的5次观察是否没有
NaNs

# populate NaNs at minutely freq
# ======================
df = raw_data.asfreq('T')
print(df)

                       open    high     low   close  volume
date_time                                                  
2014-01-02 08:56:00  1.3755  1.3755  1.3755  1.3755       3
2014-01-02 08:57:00     NaN     NaN     NaN     NaN     NaN
2014-01-02 08:58:00     NaN     NaN     NaN     NaN     NaN
2014-01-02 08:59:00     NaN     NaN     NaN     NaN     NaN
2014-01-02 09:00:00  1.3756  1.3756  1.3755  1.3755      21
2014-01-02 09:01:00  1.3755  1.3755  1.3754  1.3755      18
2014-01-02 09:02:00  1.3755  1.3755  1.3755  1.3755      15
2014-01-02 09:03:00  1.3755  1.3756  1.3755  1.3756      39
2014-01-02 09:04:00  1.3756  1.3756  1.3756  1.3756      37
2014-01-02 09:05:00  1.3756  1.3756  1.3756  1.3756      35
2014-01-02 09:06:00  1.3756  1.3757  1.3756  1.3756      38
2014-01-02 09:07:00  1.3756  1.3757  1.3756  1.3757      42
2014-01-02 09:08:00  1.3757  1.3757  1.3756  1.3757      25

consecutive_previous_5min = pd.rolling_apply(df['open'], 5, lambda g: np.isnan(g).any()) == 0
consecutive_previous_5min

date_time
2014-01-02 08:56:00    False
2014-01-02 08:57:00    False
2014-01-02 08:58:00    False
2014-01-02 08:59:00    False
2014-01-02 09:00:00    False
2014-01-02 09:01:00    False
2014-01-02 09:02:00    False
2014-01-02 09:03:00    False
2014-01-02 09:04:00     True
2014-01-02 09:05:00     True
2014-01-02 09:06:00     True
2014-01-02 09:07:00     True
2014-01-02 09:08:00     True
Freq: T, dtype: bool

# use the reverse trick to get the next 5 values
consecutive_next_5min = (pd.rolling_apply(df['open'][::-1], 5, lambda g: np.isnan(g).any()) == 0)[::-1]
consecutive_next_5min

date_time
2014-01-02 08:56:00    False
2014-01-02 08:57:00    False
2014-01-02 08:58:00    False
2014-01-02 08:59:00    False
2014-01-02 09:00:00     True
2014-01-02 09:01:00     True
2014-01-02 09:02:00     True
2014-01-02 09:03:00     True
2014-01-02 09:04:00     True
2014-01-02 09:05:00    False
2014-01-02 09:06:00    False
2014-01-02 09:07:00    False
2014-01-02 09:08:00    False
Freq: T, dtype: bool

# keep rows with either have recent 5 or next 5 elements non-null
df.loc[consecutive_next_5min | consecutive_previous_5min]

                       open    high     low   close  volume
date_time                                                  
2014-01-02 09:00:00  1.3756  1.3756  1.3755  1.3755      21
2014-01-02 09:01:00  1.3755  1.3755  1.3754  1.3755      18
2014-01-02 09:02:00  1.3755  1.3755  1.3755  1.3755      15
2014-01-02 09:03:00  1.3755  1.3756  1.3755  1.3756      39
2014-01-02 09:04:00  1.3756  1.3756  1.3756  1.3756      37
2014-01-02 09:05:00  1.3756  1.3756  1.3756  1.3756      35
2014-01-02 09:06:00  1.3756  1.3757  1.3756  1.3756      38
2014-01-02 09:07:00  1.3756  1.3757  1.3756  1.3757      42
2014-01-02 09:08:00  1.3757  1.3757  1.3756  1.3757      25

谢谢,这可以很好地只显示我想要的数据,但是如何选择5分钟的间隔呢?