使用Python/Pandas提取时间序列中的工作日_Python_Pandas_Time Series

使用Python/Pandas提取时间序列中的工作日

python pandas

使用Python/Pandas提取时间序列中的工作日,python,pandas,time-series,Python,Pandas,Time Series,我正在处理时间序列中的高频数据，我希望从我的数据中获取所有工作日。我的数据观测值以秒分隔，因此每天有86400秒，我的数据集分布在31天内，因此有2678400个观测值以下是我的部分数据： In[1]: ts Out[1]: 2013-01-01 00:00:00 0.480928 2013-01-01 00:00:01 0.480928 2013-01-01 00:00:02 0.483977 2013-01-01 00:00:03 0.486725 2013-01

我正在处理时间序列中的高频数据，我希望从我的数据中获取所有工作日。我的数据观测值以秒分隔，因此每天有86400秒，我的数据集分布在31天内，因此有2678400个观测值

以下是我的部分数据：

In[1]: ts
Out[1]: 
2013-01-01 00:00:00    0.480928
2013-01-01 00:00:01    0.480928
2013-01-01 00:00:02    0.483977
2013-01-01 00:00:03    0.486725
2013-01-01 00:00:04    0.486725
...
2013-01-31 23:59:56    0.451630
2013-01-31 23:59:57    0.451630
2013-01-31 23:59:58    0.451630
2013-01-31 23:59:59    0.454683
Freq: S, Length: 2678400

我想做的是创建一个新的时间序列，它包含从本月开始的工作日，但我想让它们包含相应的数据秒。例如，如果2013-01-02周三到2013-01-04周五是1月份第一周的第一个工作日，则：

2013-01-02 00:00:00    0.507477
2013-01-02 00:00:01    0.501373
...
2013-01-03 00:00:00    0.489778
2013-01-03 00:00:01    0.489778
...
2013-01-04 23:59:58    0.598115
2013-01-04 23:59:59    0.598115
Freq: S, Length: 259200

因此，它当然会排除Sat 2013-01-05和2013-01-06上的所有数据，因为这是周末。等等

我尝试使用一些pandas内置命令，但找不到正确的命令，因为它们按天聚合，而没有考虑到每天都包含子列。也就是说，每秒钟都有一个值，不应求平均值，而应将其分组为一个新的序列

例如，我尝试：

ts.asfreqBDay->查找工作日，但平均每天 ts.resample->您必须定义“how”的平均值、最大值、最小值。。。 ts.groupbylambda x:x.weekday->也不是！ ts=pd.Seriesdf，index=pd.bdate_rangestart='2013/01/01 00:00:00'，end='2013/01/31 23:59:59'，freq='S' ->df，因为原始数据为DataFramem。使用pd.bdate_范围没有帮助，因为df和index必须在同一维度中。。我搜索了熊猫的文档，用谷歌搜索，但找不到任何线索。。。有人有主意吗

我非常感谢你的帮助

谢谢

p、我宁愿不使用循环，因为我的数据集非常大。。。

我还有其他几个月要分析

不幸的是，这有点慢，但至少应该给出你想要的答案

#create an index of just the date portion of your index (this is the slow step)
ts_days = pd.to_datetime(ts.index.date)

#create a range of business days over that period
bdays = pd.bdate_range(start=ts.index[0].date(), end=ts.index[-1].date())

#Filter the series to just those days contained in the business day range.
ts = ts[ts_days.isin(bdays)]

现代熊猫用纳秒时间单位存储时间戳，人们可以通过检查ts.index.value来检查时间戳。将原始索引和由bdate_range生成的索引转换为每日时间单位[D]并检查这两个数组中是否包含索引要快得多：

import numpy as np
import pandas

def _get_days_array(index):
    "Convert the index to a datetime64[D] array"
    return index.values.astype('<M8[D]')

def retain_business_days(ts):
    "Retain only the business days"
    tsdays = _get_days_array(ts.index) 
    bdays = _get_days_array(pandas.bdate_range(tsdays[0], tsdays[-1]))
    mask = np.in1d(tsdays, bdays)
    return ts[mask]

这正是我想要的。工作完美！非常感谢。