Python Pandas-获取与连续日期时间关联的属性
我有一个数据框,它有一个按分钟列出的日期时间列表(通常以小时为增量),例如2018-01-14 03:00、2018-01-14 04:00等 我想做的是通过我定义的分钟增量(有些可以是60,有些可以是15,等等)捕获连续记录的数量。然后,我想关联块中的第一次和最后一次读取时间 以以下数据为例:Python Pandas-获取与连续日期时间关联的属性,python,pandas,Python,Pandas,我有一个数据框,它有一个按分钟列出的日期时间列表(通常以小时为增量),例如2018-01-14 03:00、2018-01-14 04:00等 我想做的是通过我定义的分钟增量(有些可以是60,有些可以是15,等等)捕获连续记录的数量。然后,我想关联块中的第一次和最后一次读取时间 以以下数据为例: id reading_time type 1 1/6/2018 00:00 Interval 1 1/6/2018
id reading_time type
1 1/6/2018 00:00 Interval
1 1/6/2018 01:00 Interval
1 1/6/2018 02:00 Interval
1 1/6/2018 03:00 Interval
1 1/6/2018 06:00 Interval
1 1/6/2018 07:00 Interval
1 1/6/2018 09:00 Interval
1 1/6/2018 10:00 Interval
1 1/6/2018 14:00 Interval
1 1/6/2018 15:00 Interval
我希望输出如下所示:
id first_reading_time last_reading_time number_of_records type
1 1/6/2018 00:00 1/6/2018 03:00 4 Received
1 1/6/2018 04:00 1/6/2018 05:00 2 Missed
1 1/6/2018 06:00 1/6/2018 07:00 2 Received
1 1/6/2018 08:00 1/6/2018 08:00 1 Missed
1 1/6/2018 09:00 1/6/2018 10:00 2 Received
1 1/6/2018 11:00 1/6/2018 13:00 3 Missed
1 1/6/2018 14:00 1/6/2018 15:00 2 Received
现在,在这个例子中只有一天,我可以写一天的代码。许多行跨越多天
现在,我能够捕获到的是第一个连续记录出现之前的聚合,而不是使用以下代码的下一个集合:
first_reading_time = df['reading_time'][0]
last_reaeding_time = df['reading_time'][idx_loc-1]
df = pd.DataFrame(data=d)
df.reading_time = pd.to_datetime(df.reading_time)
d = pd.Timedelta(60, 'm')
df = df.sort_values('reading_time', ascending=True)
consecutive = df.reading_time.diff().fillna(0).abs().le(d)
df['consecutive'] = consecutive
df.iloc[:idx_loc]
idx_loc = df.index.get_loc(consecutive.idxmin())
其中,数据框“d”表示顶部更细粒度的数据。根据当前行和前一行之间的分钟数差异,将变量“continuent”标记为每个记录为True或False的代码行。变量idx_loc捕获连续的行数,但它仅捕获第一组(在本例中为1/6/2018 00:00和1/6/2018 00:03)
感谢您的帮助
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
屈服
first last count type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 Received
4 2018-01-06 08:00:00 2018-01-06 08:00:00 1 Missed
5 2018-01-06 09:00:00 2018-01-06 10:00:00 2 Received
6 2018-01-06 11:00:00 2018-01-06 13:00:00 3 Missed
7 2018-01-06 14:00:00 2018-01-06 15:00:00 2 Received
您可以使用
asfreq
扩展数据帧以包括缺少的行:
df = df.set_index('reading_time')
df = df.asfreq('1H')
df = df.reset_index()
# reading_time id type
# 0 2018-01-06 00:00:00 1.0 Interval
# 1 2018-01-06 01:00:00 1.0 Interval
# 2 2018-01-06 02:00:00 1.0 Interval
# 3 2018-01-06 03:00:00 1.0 Interval
# 4 2018-01-06 04:00:00 NaN NaN
# 5 2018-01-06 05:00:00 NaN NaN
# 6 2018-01-06 06:00:00 1.0 Interval
# 7 2018-01-06 07:00:00 1.0 Interval
# 8 2018-01-06 08:00:00 NaN NaN
# 9 2018-01-06 09:00:00 1.0 Interval
# 10 2018-01-06 10:00:00 1.0 Interval
# 11 2018-01-06 11:00:00 NaN NaN
# 12 2018-01-06 12:00:00 NaN NaN
# 13 2018-01-06 13:00:00 NaN NaN
# 14 2018-01-06 14:00:00 1.0 Interval
# 15 2018-01-06 15:00:00 1.0 Interval
接下来,使用id
列中的NAN来标识组:
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
然后按组
值分组,查找每组的第一次
和最后一次
读取次数
:
result = df.groupby('group')['reading_time'].agg(['first','last','count'])
# first last count
# group
# 1 2018-01-06 00:00:00 2018-01-06 03:00:00 4
# 2 2018-01-06 04:00:00 2018-01-06 05:00:00 2
# 3 2018-01-06 06:00:00 2018-01-06 07:00:00 2
# 4 2018-01-06 08:00:00 2018-01-06 08:00:00 1
# 5 2018-01-06 09:00:00 2018-01-06 10:00:00 2
# 6 2018-01-06 11:00:00 2018-01-06 13:00:00 3
# 7 2018-01-06 14:00:00 2018-01-06 15:00:00 2
由于未命中
和接收
值交替出现,因此可以从索引生成它们:
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
要基于每个id处理多个频率,您可以使用:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 'reading_time': ['1/6/2018 00:00', '1/6/2018 01:00', '1/6/2018 02:00', '1/6/2018 03:00', '1/6/2018 06:00', '1/6/2018 07:00', '1/6/2018 09:00', '1/6/2018 10:00', '1/6/2018 14:00', '1/6/2018 15:00'], 'type': ['Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval', 'Interval']} )
df['reading_time'] = pd.to_datetime(df['reading_time'])
df = df.sort_values(by='reading_time')
df = df.set_index('reading_time')
freqmap = {1:'1H', 2:'15T'}
df = df.groupby('id', group_keys=False).apply(
lambda grp: grp.asfreq(freqmap[grp['id'][0]]))
df = df.reset_index(level='reading_time')
df['group'] = (pd.isnull(df['id']).astype(int).diff() != 0).cumsum()
grouped = df.groupby('group')
result = grouped['reading_time'].agg(['first','last','count'])
result['id'] = grouped['id'].agg('first')
types = pd.Categorical(['Missed', 'Received'])
result['type'] = types[result.index % 2]
产生
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1.0 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 NaN Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1.0 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 NaN Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2.0 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 NaN Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2.0 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 NaN Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2.0 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 NaN Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2.0 Received
似乎“遗漏”行不应与任何id
关联,但为了使结果更接近您发布的结果,您可以ffill
向前填充NaN id值:
result['id'] = result['id'].ffill()
将结果更改为
first last count id type
group
1 2018-01-06 00:00:00 2018-01-06 03:00:00 4 1 Received
2 2018-01-06 04:00:00 2018-01-06 05:00:00 2 1 Missed
3 2018-01-06 06:00:00 2018-01-06 07:00:00 2 1 Received
4 2018-01-06 07:15:00 2018-01-06 08:45:00 7 1 Missed
5 2018-01-06 09:00:00 2018-01-06 09:00:00 1 2 Received
6 2018-01-06 09:15:00 2018-01-06 09:45:00 3 2 Missed
7 2018-01-06 10:00:00 2018-01-06 10:00:00 1 2 Received
8 2018-01-06 10:15:00 2018-01-06 13:45:00 15 2 Missed
9 2018-01-06 14:00:00 2018-01-06 14:00:00 1 2 Received
10 2018-01-06 14:15:00 2018-01-06 14:45:00 3 2 Missed
11 2018-01-06 15:00:00 2018-01-06 15:00:00 1 2 Received
希望该链接有助于
id
值在行的分组方式中发挥作用吗?如何确定标记为“Missed”的行的id
值为1?(如果这些行丢失了,它们是否属于任何id
或noid
。)哇……这太棒了!是否可以将.asfreq('1H')函数转换为可变分钟数?例如,一个ID的读数可能以60分钟为间隔,而其他ID的读数可能以15分钟为间隔,而其他ID的读数可能以5分钟为间隔。我修改了上面的帖子,以展示如何在每个ID的基础上处理多个频率。最后一个问题是,如果我想把这个问题带到一天中的第一个或最后一个小时,该怎么办?例如,假设我们错过了前两个小时:2018-01-06 00:00和2018-01-06 01:00和/或最后两个小时,即2018-01-06 22:00和2018-01-06 23:00。我们如何才能捕捉到遗漏的读数?我认为最简单的处理方法是在df
中添加一个新的起始行和/或结束行,按上述步骤进行,然后将最后一行更改为result['type']=types[result.index%2+1]
。