在python中操作时间序列数据:在一段时间内对序列求和和和聚合

在python中操作时间序列数据:在一段时间内对序列求和和和聚合,python,pandas,time-series,Python,Pandas,Time Series,我正试图找出如何可视化一些传感器数据。我每5分钟收集一次多个设备的数据,存储在类似这样的JSON结构中(请注意,我无法控制数据结构): 格式为[“2019-04-17T14:30:00+00:00”,300,0]的每个元组都是[时间戳、粒度、值]。设备按项目id分组。在任何给定的组中,我希望获取多个设备的数据并将它们相加。例如,对于上述样本数据,我希望最终的系列如下所示: ["2019-04-17T14:30:00+00:00", 300, 1], ["2019-04-17T14:35:00+0

我正试图找出如何可视化一些传感器数据。我每5分钟收集一次多个设备的数据,存储在类似这样的JSON结构中(请注意,我无法控制数据结构):

格式为
[“2019-04-17T14:30:00+00:00”,300,0]
的每个元组都是
[时间戳、粒度、值]
。设备按项目id分组。在任何给定的组中,我希望获取多个设备的数据并将它们相加。例如,对于上述样本数据,我希望最终的系列如下所示:

["2019-04-17T14:30:00+00:00", 300, 1],
["2019-04-17T14:35:00+00:00", 300, 3],
序列的长度不一定相同

最后,我想将这些测量值汇总到每小时的样本中

我可以得到这样的单个系列:

with open('data.json') as fd:
  data = pd.read_json(fd)

for i, group in enumerate(data.group):
    project = group['project_id']
    instances = data.measures[i]['measures']
    series_for_group = []
    for instance in instances.keys():
        measures = instances[instance][metric][aggregate]

        # build an index from the timestamps
        index = pd.DatetimeIndex(measure[0] for measure in measures)

        # extract values from the data and link it to the index
        series = pd.Series((measure[2] for measure in measures),
                           index=index)

        series_for_group.append(series)
result = pd.DataFrame({}, columns=['timestamp', 'granularity', 'value',
                               'project', 'uuid', 'metric', 'agg'])
for i, group in enumerate(data.group):
    project = group['id']
    instances = data.measures[i]['measures']

    series_for_group = []


    for device, measures in instances.items():
        for metric, aggs in measures.items():
            for agg, lst in aggs.items():
                sub_df = pd.DataFrame(lst, columns = ['timestamp', 'granularity', 'value'])
                sub_df['project'] = project
                sub_df['uuid'] = device
                sub_df['metric'] = metric
                sub_df['agg'] = agg

                result = pd.concat((result,sub_df), sort=True)

# parse date:
result['timestamp'] = pd.to_datetime(result['timestamp'])
在外部
for
循环的底部,我有一个
pandas.core.series.series
对象数组,表示与当前组关联的不同测量集。我希望我可以简单地将它们加在一起,如
total=sum(series\u for\u group)
中所示,但这会产生无效数据

  • 我是否正确读取了这些数据?这是我第一次与熊猫合作;我不确定(a)创建一个索引,然后(b)填充数据是否是正确的过程

  • 我如何成功地将这些系列相加

  • 如何将这些数据重新采样为1小时的间隔?从这个例子来看,似乎对
    .groupby
    .agg
    方法很感兴趣,但是从这个例子中不清楚如何指定间隔大小

  • 更新1

    也许我可以使用
    concat
    groupby
    ?例如:

    final = pd.concat(all_series).groupby(level=0).sum()
    
    要从具有不同长度的系列(例如s1、s2、s3)构建数据帧(df),您可以尝试:

    df=pd.concat([s1,s2,s3], ignore_index=True, axis=1).fillna('')
    
    构建数据帧后:

  • 确保所有日期都存储为时间戳对象:

    df['Date']=pd.to_datetime(df['Date'])

  • 然后,添加另一列以从日期列中提取小时数:

    df['Hour']=df['Date'].dt.hour
    
    然后按小时分组,并总结数值:

    df.groupby('Hour').sum()
    

    我在评论中建议这样做:

    with open('data.json') as fd:
      data = pd.read_json(fd)
    
    for i, group in enumerate(data.group):
        project = group['project_id']
        instances = data.measures[i]['measures']
        series_for_group = []
        for instance in instances.keys():
            measures = instances[instance][metric][aggregate]
    
            # build an index from the timestamps
            index = pd.DatetimeIndex(measure[0] for measure in measures)
    
            # extract values from the data and link it to the index
            series = pd.Series((measure[2] for measure in measures),
                               index=index)
    
            series_for_group.append(series)
    
    result = pd.DataFrame({}, columns=['timestamp', 'granularity', 'value',
                                   'project', 'uuid', 'metric', 'agg'])
    for i, group in enumerate(data.group):
        project = group['id']
        instances = data.measures[i]['measures']
    
        series_for_group = []
    
    
        for device, measures in instances.items():
            for metric, aggs in measures.items():
                for agg, lst in aggs.items():
                    sub_df = pd.DataFrame(lst, columns = ['timestamp', 'granularity', 'value'])
                    sub_df['project'] = project
                    sub_df['uuid'] = device
                    sub_df['metric'] = metric
                    sub_df['agg'] = agg
    
                    result = pd.concat((result,sub_df), sort=True)
    
    # parse date:
    result['timestamp'] = pd.to_datetime(result['timestamp'])
    
    这会产生如下数据

        agg     granularity         metric  project     timestamp           uuid                value
    0   mean    300     metric.name.here    01234   2019-04-17 14:30:00     ...device 1 uuid...     1
    1   mean    300     metric.name.here    01234   2019-04-17 14:35:00     ...device 1 uuid...     2
    0   mean    300     metric.name.here    01234   2019-04-17 14:30:00     ...device 2 uuid...     0
    1   mean    300     metric.name.here    01234   2019-04-17 14:35:00     ...device 2 uuid...     1
    
    然后可以进行总体聚合

    result.resample('H', on='timestamp').sum()
    
    其中:

    timestamp
    2019-04-17 14:00:00    4
    Freq: H, Name: value, dtype: int64
    
    uuid                 timestamp          
    ...device 1 uuid...  2019-04-17 14:00:00    3
    ...device 2 uuid...  2019-04-17 14:00:00    1
    Name: value, dtype: int64
    
    或分组聚合:

    result.groupby('uuid').resample('H', on='timestamp').value.sum()
    
    其中:

    timestamp
    2019-04-17 14:00:00    4
    Freq: H, Name: value, dtype: int64
    
    uuid                 timestamp          
    ...device 1 uuid...  2019-04-17 14:00:00    3
    ...device 2 uuid...  2019-04-17 14:00:00    1
    Name: value, dtype: int64
    

    最后,我根据问题中的代码找到了一个可行的解决方案。在我的系统上,处理大约85MB的输入数据大约需要6秒钟。相比之下,我在5分钟后取消了Quang的代码

    我不知道这是否是处理这些数据的正确方法,但它产生的结果显然是正确的。我注意到,在这个解决方案中,构建一个系列列表,然后进行单个
    pd.concat
    调用比将
    pd.concat
    放入循环中更有效

    #!/usr/bin/python3
    
    import click
    import matplotlib.pyplot as plt
    import pandas as pd
    
    
    @click.command()
    @click.option('-a', '--aggregate', default='mean')
    @click.option('-p', '--projects')
    @click.option('-r', '--resample')
    @click.option('-o', '--output')
    @click.argument('metric')
    @click.argument('datafile', type=click.File(mode='rb'))
    def plot_metric(aggregate, projects, output, resample, metric, datafile):
    
        # Read in a list of project id -> project name mappings, then
        # convert it to a dictionary.
        if projects:
            _projects = pd.read_json(projects)
            projects = {_projects.ID[n]: _projects.Name[n].lstrip('_')
                        for n in range(len(_projects))}
        else:
            projects = {}
    
        data = pd.read_json(datafile)
        df = pd.DataFrame()
    
        for i, group in enumerate(data.group):
            project = group['project_id']
            project = projects.get(project, project)
    
            devices = data.measures[i]['measures']
            all_series = []
            for device, measures in devices.items():
                samples = measures[metric][aggregate]
                index = pd.DatetimeIndex(sample[0] for sample in samples)
                series = pd.Series((sample[2] for sample in samples),
                                   index=index)
                all_series.append(series)
    
            # concatenate all the measurements for this project, then
            # group them using the timestamp and sum the values.
            final = pd.concat(all_series).groupby(level=0).sum()
    
            # resample the data if requested
            if resample:
                final = final.resample(resample).sum()
    
            # add series to dataframe
            df[project] = final
    
        fig, ax = plt.subplots()
        df.plot(ax=ax, figsize=(11, 8.5))
        ax.legend(frameon=False, loc='upper right', ncol=3)
    
        if output:
            plt.savefig(output)
            plt.close()
        else:
            plt.show()
    
    
    if __name__ == '__main__':
        plot_metric()
    

    我不确定循环是否是一种方式。你可能想充分膨胀你的数据,并将其聚合到一个大数据框架中,然后再进行处理。我不确定“膨胀我的数据”是什么意思。我的意思与你的代码非常相似,但将所有信息汇集在一起并附加到一个大数据框架中。此数据帧的每一列都有一个数据类型,而不是
    dict
    。与使用
    .resample('H').sum()
    )相比,此数据帧有任何优势吗?您说过,“我不确定循环是否可行”,但此解决方案使用了深度嵌套的循环。我试着运行这个,五分钟后它仍然在运行;这里有些错误,因为我的代码大约在6秒钟内完成。今晚晚些时候我会发布它,也许我们可以找出区别。@larsks是的,我把代码放在那里只是为了展示最终数据帧的样子。无论如何,我都没想到嵌套for循环会有极端的性能。