Python 如何找出不连续的日期时间索引?如何将平均值置于连续指数之上?
我有一个时间序列数据。但是数据有不连续性。(Python 如何找出不连续的日期时间索引?如何将平均值置于连续指数之上?,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我有一个时间序列数据。但是数据有不连续性。(2005-03-02 02:08:00缺失) 我需要一个新的列C,以便C(I)=a(I)+B(I)+平均值,其中我的平均值是B的平均值,直到不连续(02:08:00) average=Data.loc['2005-03-02 02:05:30':'2005-03-02 02:07:30',['B']].mean(axis=0) After discontinuity we have to again recalculate average till
2005-03-02 02:08:00
缺失)
我需要一个新的列C,以便C(I)=a(I)+B(I)+平均值
,其中我的平均值是B的平均值,直到不连续(02:08:00)
average=Data.loc['2005-03-02 02:05:30':'2005-03-02 02:07:30',['B']].mean(axis=0)
After discontinuity we have to again recalculate average till next discontinuity
average=Data.loc['2005-03-02 02:08:30':'2005-03-02 02:11:00',['B']].mean(axis=0)
输入
Date,A,B
2005-03-02 02:05:30,1,3
2005-03-02 02:06:00,2,4
2005-03-02 02:06:30,3,5
2005-03-02 02:07:00,4,6
2005-03-02 02:07:30,5,7
2005-03-02 02:08:30,7,9
2005-03-02 02:09:00,7,9
2005-03-02 02:09:30,7,9
2005-03-02 02:10:00,8,12
2005-03-02 02:10:30,9,13
2005-03-02 02:11:00,10,14
输出
Date,A,B,C
2005-03-02 02:05:30,1,3,9
2005-03-02 02:06:00,2,4,11
2005-03-02 02:06:30,3,5,13
2005-03-02 02:07:00,4,6,15
2005-03-02 02:07:30,5,7,17
2005-03-02 02:08:30,7,9,28
2005-03-02 02:09:00,7,9,28
2005-03-02 02:09:30,7,9,28
2005-03-02 02:10:00,8,12,32
2005-03-02 02:10:30,9,13,34
2005-03-02 02:11:00,10,14,36
如何找出索引中的不连续性
如何使用pandas完成所有工作?如果一个点被描述为p(v,t)。A=(3,1)和B=(10,5) 因此,任何C(v,t)=A(v)+(B(v)-A(v))*((C(t)-A(t)):(B(t)-A(t))
步骤1:读取数据帧
import pandas as pd
from io import StringIO
y = '''Date,A,B
2005-03-02 02:05:30,1,3
2005-03-02 02:06:00,2,4
2005-03-02 02:06:30,3,5
2005-03-02 02:07:00,4,6
2005-03-02 02:07:30,5,7
2005-03-02 02:08:30,7,9
2005-03-02 02:09:00,7,9
2005-03-02 02:09:30,7,9
2005-03-02 02:10:00,8,12
2005-03-02 02:10:30,9,13
2005-03-02 02:11:00,10,14'''
df = pd.read_csv(StringIO(y), index_col='Date')
步骤2:转换为日期时间索引
df.index = pd.to_datetime(df.index)
步骤2:以30秒的持续时间重新采样
new = df.resample('30s').mean()
输出:
A B
Date
2005-03-02 02:05:30 1.0 3.0
2005-03-02 02:06:00 2.0 4.0
2005-03-02 02:06:30 3.0 5.0
2005-03-02 02:07:00 4.0 6.0
2005-03-02 02:07:30 5.0 7.0
2005-03-02 02:08:00 NaN NaN
2005-03-02 02:08:30 7.0 9.0
2005-03-02 02:09:00 7.0 9.0
2005-03-02 02:09:30 7.0 9.0
2005-03-02 02:10:00 8.0 12.0
2005-03-02 02:10:30 9.0 13.0
2005-03-02 02:11:00 10.0 14.0
A B group_no
Date
2005-03-02 02:05:30 1.0 3.0 0
2005-03-02 02:06:00 2.0 4.0 0
2005-03-02 02:06:30 3.0 5.0 0
2005-03-02 02:07:00 4.0 6.0 0
2005-03-02 02:07:30 5.0 7.0 0
2005-03-02 02:08:00 NaN NaN 1
2005-03-02 02:08:30 7.0 9.0 1
2005-03-02 02:09:00 7.0 9.0 1
2005-03-02 02:09:30 7.0 9.0 1
2005-03-02 02:10:00 8.0 12.0 1
2005-03-02 02:10:30 9.0 13.0 1
2005-03-02 02:11:00 10.0 14.0 1
A B group_no Bmean
Date
2005-03-02 02:05:30 1.0 3.0 0 5.0
2005-03-02 02:06:00 2.0 4.0 0 5.0
2005-03-02 02:06:30 3.0 5.0 0 5.0
2005-03-02 02:07:00 4.0 6.0 0 5.0
2005-03-02 02:07:30 5.0 7.0 0 5.0
2005-03-02 02:08:00 NaN NaN 1 11.0
2005-03-02 02:08:30 7.0 9.0 1 11.0
2005-03-02 02:09:00 7.0 9.0 1 11.0
2005-03-02 02:09:30 7.0 9.0 1 11.0
2005-03-02 02:10:00 8.0 12.0 1 11.0
2005-03-02 02:10:30 9.0 13.0 1 11.0
2005-03-02 02:11:00 10.0 14.0 1 11.0
A B C
Date
2005-03-02 02:05:30 1.0 3.0 9.0
2005-03-02 02:06:00 2.0 4.0 11.0
2005-03-02 02:06:30 3.0 5.0 13.0
2005-03-02 02:07:00 4.0 6.0 15.0
2005-03-02 02:07:30 5.0 7.0 17.0
2005-03-02 02:08:00 NaN NaN NaN
2005-03-02 02:08:30 7.0 9.0 27.0
2005-03-02 02:09:00 7.0 9.0 27.0
2005-03-02 02:09:30 7.0 9.0 27.0
2005-03-02 02:10:00 8.0 12.0 31.0
2005-03-02 02:10:30 9.0 13.0 33.0
2005-03-02 02:11:00 10.0 14.0 35.0
步骤3:按NaN行拆分数据帧并获取组ID
new["group_no"] = new.T.isnull().all().cumsum()
new['C'] = new['A'] + new['B'] + new['Bmean']
new.drop(['group_no', 'Bmean'], axis=1, inplace=True)
输出:
A B
Date
2005-03-02 02:05:30 1.0 3.0
2005-03-02 02:06:00 2.0 4.0
2005-03-02 02:06:30 3.0 5.0
2005-03-02 02:07:00 4.0 6.0
2005-03-02 02:07:30 5.0 7.0
2005-03-02 02:08:00 NaN NaN
2005-03-02 02:08:30 7.0 9.0
2005-03-02 02:09:00 7.0 9.0
2005-03-02 02:09:30 7.0 9.0
2005-03-02 02:10:00 8.0 12.0
2005-03-02 02:10:30 9.0 13.0
2005-03-02 02:11:00 10.0 14.0
A B group_no
Date
2005-03-02 02:05:30 1.0 3.0 0
2005-03-02 02:06:00 2.0 4.0 0
2005-03-02 02:06:30 3.0 5.0 0
2005-03-02 02:07:00 4.0 6.0 0
2005-03-02 02:07:30 5.0 7.0 0
2005-03-02 02:08:00 NaN NaN 1
2005-03-02 02:08:30 7.0 9.0 1
2005-03-02 02:09:00 7.0 9.0 1
2005-03-02 02:09:30 7.0 9.0 1
2005-03-02 02:10:00 8.0 12.0 1
2005-03-02 02:10:30 9.0 13.0 1
2005-03-02 02:11:00 10.0 14.0 1
A B group_no Bmean
Date
2005-03-02 02:05:30 1.0 3.0 0 5.0
2005-03-02 02:06:00 2.0 4.0 0 5.0
2005-03-02 02:06:30 3.0 5.0 0 5.0
2005-03-02 02:07:00 4.0 6.0 0 5.0
2005-03-02 02:07:30 5.0 7.0 0 5.0
2005-03-02 02:08:00 NaN NaN 1 11.0
2005-03-02 02:08:30 7.0 9.0 1 11.0
2005-03-02 02:09:00 7.0 9.0 1 11.0
2005-03-02 02:09:30 7.0 9.0 1 11.0
2005-03-02 02:10:00 8.0 12.0 1 11.0
2005-03-02 02:10:30 9.0 13.0 1 11.0
2005-03-02 02:11:00 10.0 14.0 1 11.0
A B C
Date
2005-03-02 02:05:30 1.0 3.0 9.0
2005-03-02 02:06:00 2.0 4.0 11.0
2005-03-02 02:06:30 3.0 5.0 13.0
2005-03-02 02:07:00 4.0 6.0 15.0
2005-03-02 02:07:30 5.0 7.0 17.0
2005-03-02 02:08:00 NaN NaN NaN
2005-03-02 02:08:30 7.0 9.0 27.0
2005-03-02 02:09:00 7.0 9.0 27.0
2005-03-02 02:09:30 7.0 9.0 27.0
2005-03-02 02:10:00 8.0 12.0 31.0
2005-03-02 02:10:30 9.0 13.0 33.0
2005-03-02 02:11:00 10.0 14.0 35.0
第4步:获取各组的平均值B\u否
new['Bmean'] = new.groupby('group_no').transform('mean').B
输出:
A B
Date
2005-03-02 02:05:30 1.0 3.0
2005-03-02 02:06:00 2.0 4.0
2005-03-02 02:06:30 3.0 5.0
2005-03-02 02:07:00 4.0 6.0
2005-03-02 02:07:30 5.0 7.0
2005-03-02 02:08:00 NaN NaN
2005-03-02 02:08:30 7.0 9.0
2005-03-02 02:09:00 7.0 9.0
2005-03-02 02:09:30 7.0 9.0
2005-03-02 02:10:00 8.0 12.0
2005-03-02 02:10:30 9.0 13.0
2005-03-02 02:11:00 10.0 14.0
A B group_no
Date
2005-03-02 02:05:30 1.0 3.0 0
2005-03-02 02:06:00 2.0 4.0 0
2005-03-02 02:06:30 3.0 5.0 0
2005-03-02 02:07:00 4.0 6.0 0
2005-03-02 02:07:30 5.0 7.0 0
2005-03-02 02:08:00 NaN NaN 1
2005-03-02 02:08:30 7.0 9.0 1
2005-03-02 02:09:00 7.0 9.0 1
2005-03-02 02:09:30 7.0 9.0 1
2005-03-02 02:10:00 8.0 12.0 1
2005-03-02 02:10:30 9.0 13.0 1
2005-03-02 02:11:00 10.0 14.0 1
A B group_no Bmean
Date
2005-03-02 02:05:30 1.0 3.0 0 5.0
2005-03-02 02:06:00 2.0 4.0 0 5.0
2005-03-02 02:06:30 3.0 5.0 0 5.0
2005-03-02 02:07:00 4.0 6.0 0 5.0
2005-03-02 02:07:30 5.0 7.0 0 5.0
2005-03-02 02:08:00 NaN NaN 1 11.0
2005-03-02 02:08:30 7.0 9.0 1 11.0
2005-03-02 02:09:00 7.0 9.0 1 11.0
2005-03-02 02:09:30 7.0 9.0 1 11.0
2005-03-02 02:10:00 8.0 12.0 1 11.0
2005-03-02 02:10:30 9.0 13.0 1 11.0
2005-03-02 02:11:00 10.0 14.0 1 11.0
A B C
Date
2005-03-02 02:05:30 1.0 3.0 9.0
2005-03-02 02:06:00 2.0 4.0 11.0
2005-03-02 02:06:30 3.0 5.0 13.0
2005-03-02 02:07:00 4.0 6.0 15.0
2005-03-02 02:07:30 5.0 7.0 17.0
2005-03-02 02:08:00 NaN NaN NaN
2005-03-02 02:08:30 7.0 9.0 27.0
2005-03-02 02:09:00 7.0 9.0 27.0
2005-03-02 02:09:30 7.0 9.0 27.0
2005-03-02 02:10:00 8.0 12.0 31.0
2005-03-02 02:10:30 9.0 13.0 33.0
2005-03-02 02:11:00 10.0 14.0 35.0
步骤5:应用必要的转换并删除额外的列
new["group_no"] = new.T.isnull().all().cumsum()
new['C'] = new['A'] + new['B'] + new['Bmean']
new.drop(['group_no', 'Bmean'], axis=1, inplace=True)
输出:
A B
Date
2005-03-02 02:05:30 1.0 3.0
2005-03-02 02:06:00 2.0 4.0
2005-03-02 02:06:30 3.0 5.0
2005-03-02 02:07:00 4.0 6.0
2005-03-02 02:07:30 5.0 7.0
2005-03-02 02:08:00 NaN NaN
2005-03-02 02:08:30 7.0 9.0
2005-03-02 02:09:00 7.0 9.0
2005-03-02 02:09:30 7.0 9.0
2005-03-02 02:10:00 8.0 12.0
2005-03-02 02:10:30 9.0 13.0
2005-03-02 02:11:00 10.0 14.0
A B group_no
Date
2005-03-02 02:05:30 1.0 3.0 0
2005-03-02 02:06:00 2.0 4.0 0
2005-03-02 02:06:30 3.0 5.0 0
2005-03-02 02:07:00 4.0 6.0 0
2005-03-02 02:07:30 5.0 7.0 0
2005-03-02 02:08:00 NaN NaN 1
2005-03-02 02:08:30 7.0 9.0 1
2005-03-02 02:09:00 7.0 9.0 1
2005-03-02 02:09:30 7.0 9.0 1
2005-03-02 02:10:00 8.0 12.0 1
2005-03-02 02:10:30 9.0 13.0 1
2005-03-02 02:11:00 10.0 14.0 1
A B group_no Bmean
Date
2005-03-02 02:05:30 1.0 3.0 0 5.0
2005-03-02 02:06:00 2.0 4.0 0 5.0
2005-03-02 02:06:30 3.0 5.0 0 5.0
2005-03-02 02:07:00 4.0 6.0 0 5.0
2005-03-02 02:07:30 5.0 7.0 0 5.0
2005-03-02 02:08:00 NaN NaN 1 11.0
2005-03-02 02:08:30 7.0 9.0 1 11.0
2005-03-02 02:09:00 7.0 9.0 1 11.0
2005-03-02 02:09:30 7.0 9.0 1 11.0
2005-03-02 02:10:00 8.0 12.0 1 11.0
2005-03-02 02:10:30 9.0 13.0 1 11.0
2005-03-02 02:11:00 10.0 14.0 1 11.0
A B C
Date
2005-03-02 02:05:30 1.0 3.0 9.0
2005-03-02 02:06:00 2.0 4.0 11.0
2005-03-02 02:06:30 3.0 5.0 13.0
2005-03-02 02:07:00 4.0 6.0 15.0
2005-03-02 02:07:30 5.0 7.0 17.0
2005-03-02 02:08:00 NaN NaN NaN
2005-03-02 02:08:30 7.0 9.0 27.0
2005-03-02 02:09:00 7.0 9.0 27.0
2005-03-02 02:09:30 7.0 9.0 27.0
2005-03-02 02:10:00 8.0 12.0 31.0
2005-03-02 02:10:30 9.0 13.0 33.0
2005-03-02 02:11:00 10.0 14.0 35.0
我建议使用:
#if unique values in index use reindex
df = Data.reindex(pd.date_range(Data.index.min(), Data.index.max(), freq='30S'))
#if non unique values in index
#df = df.resample('30s').mean()
#get mask for NaNs rows
mask = df.isnull().all(axis=1)
#get sum of all columns
s1 = df.sum(axis=1)
#if need sum only A, B columns
#s1 = df[['A', 'B']].sum(axis=1)
#create column for grouping
df['C'] = mask.cumsum()
#filter out NaNs rows
df = df[~mask]
#transform mean and add sum
df['C'] = df.groupby('C')['B'].transform('mean') + s1
print (df)
A B C
2005-03-02 02:05:30 1.0 3.0 9.0
2005-03-02 02:06:00 2.0 4.0 11.0
2005-03-02 02:06:30 3.0 5.0 13.0
2005-03-02 02:07:00 4.0 6.0 15.0
2005-03-02 02:07:30 5.0 7.0 17.0
2005-03-02 02:08:30 7.0 9.0 27.0
2005-03-02 02:09:00 7.0 9.0 27.0
2005-03-02 02:09:30 7.0 9.0 27.0
2005-03-02 02:10:00 8.0 12.0 31.0
2005-03-02 02:10:30 9.0 13.0 33.0
2005-03-02 02:11:00 10.0 14.0 35.0
另一个解决方案,谢谢@iDrwish的建议: 首先获取索引的差异()(尚未实现,因此先通过将索引转换为序列),然后与
30 s Timedelta
进行比较,然后通过创建组
最后与平均值一起使用,并添加列的总和:
g = Data.index.to_series().diff().gt(pd.Timedelta(30, unit='s')).cumsum()
Data['C'] = Data.groupby(g)['B'].transform('mean') + Data.sum(axis=1)
#if need specify columns
#Data['C'] = Data.groupby(g)['B'].transform('mean') + Data['A'] + Data['B']
print (Data)
A B C
Date
2005-03-02 02:05:30 1 3 9
2005-03-02 02:06:00 2 4 11
2005-03-02 02:06:30 3 5 13
2005-03-02 02:07:00 4 6 15
2005-03-02 02:07:30 5 7 17
2005-03-02 02:08:30 7 9 27
2005-03-02 02:09:00 7 9 27
2005-03-02 02:09:30 7 9 27
2005-03-02 02:10:00 8 12 31
2005-03-02 02:10:30 9 13 33
2005-03-02 02:11:00 10 14 35
你认为有没有一种方法可以使用时间增量检测不连续性?超级解决方案…它节省了很多时间…非常感谢…熊猫很棒…谢谢@jezraelIf我的答案或其他答案很有帮助,别忘了-单击答案旁边的复选标记将其从灰色变为填充。谢谢。谢谢你的解决方案It’太好了……你每次都向我解释了谢谢你……我想学格劳比…@Ashish Acharya