在python中操作时间序列数据:在一段时间内对序列求和和和聚合
我正试图找出如何可视化一些传感器数据。我每5分钟收集一次多个设备的数据,存储在类似这样的JSON结构中(请注意,我无法控制数据结构): 格式为在python中操作时间序列数据:在一段时间内对序列求和和和聚合,python,pandas,time-series,Python,Pandas,Time Series,我正试图找出如何可视化一些传感器数据。我每5分钟收集一次多个设备的数据,存储在类似这样的JSON结构中(请注意,我无法控制数据结构): 格式为[“2019-04-17T14:30:00+00:00”,300,0]的每个元组都是[时间戳、粒度、值]。设备按项目id分组。在任何给定的组中,我希望获取多个设备的数据并将它们相加。例如,对于上述样本数据,我希望最终的系列如下所示: ["2019-04-17T14:30:00+00:00", 300, 1], ["2019-04-17T14:35:00+0
[“2019-04-17T14:30:00+00:00”,300,0]
的每个元组都是[时间戳、粒度、值]
。设备按项目id分组。在任何给定的组中,我希望获取多个设备的数据并将它们相加。例如,对于上述样本数据,我希望最终的系列如下所示:
["2019-04-17T14:30:00+00:00", 300, 1],
["2019-04-17T14:35:00+00:00", 300, 3],
序列的长度不一定相同
最后,我想将这些测量值汇总到每小时的样本中
我可以得到这样的单个系列:
with open('data.json') as fd:
data = pd.read_json(fd)
for i, group in enumerate(data.group):
project = group['project_id']
instances = data.measures[i]['measures']
series_for_group = []
for instance in instances.keys():
measures = instances[instance][metric][aggregate]
# build an index from the timestamps
index = pd.DatetimeIndex(measure[0] for measure in measures)
# extract values from the data and link it to the index
series = pd.Series((measure[2] for measure in measures),
index=index)
series_for_group.append(series)
result = pd.DataFrame({}, columns=['timestamp', 'granularity', 'value',
'project', 'uuid', 'metric', 'agg'])
for i, group in enumerate(data.group):
project = group['id']
instances = data.measures[i]['measures']
series_for_group = []
for device, measures in instances.items():
for metric, aggs in measures.items():
for agg, lst in aggs.items():
sub_df = pd.DataFrame(lst, columns = ['timestamp', 'granularity', 'value'])
sub_df['project'] = project
sub_df['uuid'] = device
sub_df['metric'] = metric
sub_df['agg'] = agg
result = pd.concat((result,sub_df), sort=True)
# parse date:
result['timestamp'] = pd.to_datetime(result['timestamp'])
在外部for
循环的底部,我有一个pandas.core.series.series
对象数组,表示与当前组关联的不同测量集。我希望我可以简单地将它们加在一起,如total=sum(series\u for\u group)
中所示,但这会产生无效数据
.groupby
和.agg
方法很感兴趣,但是从这个例子中不清楚如何指定间隔大小concat
和groupby
?例如:
final = pd.concat(all_series).groupby(level=0).sum()
要从具有不同长度的系列(例如s1、s2、s3)构建数据帧(df),您可以尝试:
df=pd.concat([s1,s2,s3], ignore_index=True, axis=1).fillna('')
构建数据帧后:
df['Hour']=df['Date'].dt.hour
然后按小时分组,并总结数值:
df.groupby('Hour').sum()
我在评论中建议这样做:
with open('data.json') as fd:
data = pd.read_json(fd)
for i, group in enumerate(data.group):
project = group['project_id']
instances = data.measures[i]['measures']
series_for_group = []
for instance in instances.keys():
measures = instances[instance][metric][aggregate]
# build an index from the timestamps
index = pd.DatetimeIndex(measure[0] for measure in measures)
# extract values from the data and link it to the index
series = pd.Series((measure[2] for measure in measures),
index=index)
series_for_group.append(series)
result = pd.DataFrame({}, columns=['timestamp', 'granularity', 'value',
'project', 'uuid', 'metric', 'agg'])
for i, group in enumerate(data.group):
project = group['id']
instances = data.measures[i]['measures']
series_for_group = []
for device, measures in instances.items():
for metric, aggs in measures.items():
for agg, lst in aggs.items():
sub_df = pd.DataFrame(lst, columns = ['timestamp', 'granularity', 'value'])
sub_df['project'] = project
sub_df['uuid'] = device
sub_df['metric'] = metric
sub_df['agg'] = agg
result = pd.concat((result,sub_df), sort=True)
# parse date:
result['timestamp'] = pd.to_datetime(result['timestamp'])
这会产生如下数据
agg granularity metric project timestamp uuid value
0 mean 300 metric.name.here 01234 2019-04-17 14:30:00 ...device 1 uuid... 1
1 mean 300 metric.name.here 01234 2019-04-17 14:35:00 ...device 1 uuid... 2
0 mean 300 metric.name.here 01234 2019-04-17 14:30:00 ...device 2 uuid... 0
1 mean 300 metric.name.here 01234 2019-04-17 14:35:00 ...device 2 uuid... 1
然后可以进行总体聚合
result.resample('H', on='timestamp').sum()
其中:
timestamp
2019-04-17 14:00:00 4
Freq: H, Name: value, dtype: int64
uuid timestamp
...device 1 uuid... 2019-04-17 14:00:00 3
...device 2 uuid... 2019-04-17 14:00:00 1
Name: value, dtype: int64
或分组聚合:
result.groupby('uuid').resample('H', on='timestamp').value.sum()
其中:
timestamp
2019-04-17 14:00:00 4
Freq: H, Name: value, dtype: int64
uuid timestamp
...device 1 uuid... 2019-04-17 14:00:00 3
...device 2 uuid... 2019-04-17 14:00:00 1
Name: value, dtype: int64
最后,我根据问题中的代码找到了一个可行的解决方案。在我的系统上,处理大约85MB的输入数据大约需要6秒钟。相比之下,我在5分钟后取消了Quang的代码 我不知道这是否是处理这些数据的正确方法,但它产生的结果显然是正确的。我注意到,在这个解决方案中,构建一个系列列表,然后进行单个
pd.concat
调用比将pd.concat
放入循环中更有效
#!/usr/bin/python3
import click
import matplotlib.pyplot as plt
import pandas as pd
@click.command()
@click.option('-a', '--aggregate', default='mean')
@click.option('-p', '--projects')
@click.option('-r', '--resample')
@click.option('-o', '--output')
@click.argument('metric')
@click.argument('datafile', type=click.File(mode='rb'))
def plot_metric(aggregate, projects, output, resample, metric, datafile):
# Read in a list of project id -> project name mappings, then
# convert it to a dictionary.
if projects:
_projects = pd.read_json(projects)
projects = {_projects.ID[n]: _projects.Name[n].lstrip('_')
for n in range(len(_projects))}
else:
projects = {}
data = pd.read_json(datafile)
df = pd.DataFrame()
for i, group in enumerate(data.group):
project = group['project_id']
project = projects.get(project, project)
devices = data.measures[i]['measures']
all_series = []
for device, measures in devices.items():
samples = measures[metric][aggregate]
index = pd.DatetimeIndex(sample[0] for sample in samples)
series = pd.Series((sample[2] for sample in samples),
index=index)
all_series.append(series)
# concatenate all the measurements for this project, then
# group them using the timestamp and sum the values.
final = pd.concat(all_series).groupby(level=0).sum()
# resample the data if requested
if resample:
final = final.resample(resample).sum()
# add series to dataframe
df[project] = final
fig, ax = plt.subplots()
df.plot(ax=ax, figsize=(11, 8.5))
ax.legend(frameon=False, loc='upper right', ncol=3)
if output:
plt.savefig(output)
plt.close()
else:
plt.show()
if __name__ == '__main__':
plot_metric()
我不确定循环是否是一种方式。你可能想充分膨胀你的数据,并将其聚合到一个大数据框架中,然后再进行处理。我不确定“膨胀我的数据”是什么意思。我的意思与你的代码非常相似,但将所有信息汇集在一起并附加到一个大数据框架中。此数据帧的每一列都有一个数据类型,而不是
dict
。与使用.resample('H').sum()
)相比,此数据帧有任何优势吗?您说过,“我不确定循环是否可行”,但此解决方案使用了深度嵌套的循环。我试着运行这个,五分钟后它仍然在运行;这里有些错误,因为我的代码大约在6秒钟内完成。今晚晚些时候我会发布它,也许我们可以找出区别。@larsks是的,我把代码放在那里只是为了展示最终数据帧的样子。无论如何,我都没想到嵌套for循环会有极端的性能。