Python 按主机名分组。每小时(主机上)会话的平均数

Python 按主机名分组。每小时(主机上)会话的平均数,python,pandas,datetime,pandas-groupby,average,Python,Pandas,Datetime,Pandas Groupby,Average,数据帧如下所示 datetime hostname sessions 0 2020-10-27 00:00:05 server001 22 1 2020-10-27 00:00:10 server001 25 2 2020-10-27 00:00:15 server001 21 3 2020-10-27 01:00:05 server001 30 4 2020-10-27 01:00:10

数据帧如下所示

              datetime   hostname  sessions
0  2020-10-27 00:00:05  server001        22
1  2020-10-27 00:00:10  server001        25
2  2020-10-27 00:00:15  server001        21
3  2020-10-27 01:00:05  server001        30
4  2020-10-27 01:00:10  server001        30
5  2020-10-27 01:00:15  server001        35
6  2020-10-27 00:00:05  server002        15
7  2020-10-27 00:00:10  server002        10
8  2020-10-27 00:00:15  server002        11
9  2020-10-27 01:00:05  server002        19
10 2020-10-27 01:00:10  server002        22
11 2020-10-27 01:00:15  server002        18
              datetime   hostname  sessions
0  2020-10-27 00:00:00  server001        23
1  2020-10-27 01:00:00  server001        32
2  2020-10-27 00:00:00  server002        12
3  2020-10-27 01:00:00  server002        20
我试图通过单个主机名显示每小时的平均会话数

所以我会得到像这样的东西

              datetime   hostname  sessions
0  2020-10-27 00:00:05  server001        22
1  2020-10-27 00:00:10  server001        25
2  2020-10-27 00:00:15  server001        21
3  2020-10-27 01:00:05  server001        30
4  2020-10-27 01:00:10  server001        30
5  2020-10-27 01:00:15  server001        35
6  2020-10-27 00:00:05  server002        15
7  2020-10-27 00:00:10  server002        10
8  2020-10-27 00:00:15  server002        11
9  2020-10-27 01:00:05  server002        19
10 2020-10-27 01:00:10  server002        22
11 2020-10-27 01:00:15  server002        18
              datetime   hostname  sessions
0  2020-10-27 00:00:00  server001        23
1  2020-10-27 01:00:00  server001        32
2  2020-10-27 00:00:00  server002        12
3  2020-10-27 01:00:00  server002        20
我认为我的分组是错误的,因为当我尝试这样做时,我得到的结果通常是在date by hour中排序的任何给定主机名的最大平均每小时值

例如,我可能会看到

                hostname   datetime     sessions
0  2020-10-27  server001   00:00:00           23
1  2020-10-27              01:00:00           32
2  2020-10-27  server002   02:00:00           12
3  2020-10-27  server003   03:00:00           20
而不是列出每个主机名的完整24小时

我尝试的代码是:

df = df.groupby(['hostname']).resample(
        'H', on='datetime'
        ).agg({'sessions': 'mean'}).round(0).astype(int)

我需要做什么才能得到想要的结果?

以下是一个基于您提供的数据的示例。我添加了将日期转换为datetime(如果它们是对象)的步骤,并将datetime设置为
datetimeindex
,以便使用
重采样
。事情会是这样的:

import pandas as pd
import numpy as np
d ={'datetime' :['2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15','2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15'],
   'hostname':['server001','server001','server001','server001','server001','server001','server002','server002','server002','server002','server002','server002'],
   'sessions':[ 22,25,21 ,30,30,35,15,10, 11,19,22,18]}       
df = pd.DataFrame(data=d)
df['datetime'] =  pd.to_datetime(df['datetime'])
df = df.set_index(pd.DatetimeIndex(df['datetime']))
df.resample('H').mean()
实际上,您可以修改此示例以适合其他用途。正如我理解你的问题,你想计算每小时平均会话数。如果需要其他groupby.s,请检查重新采样功能

除此之外的另一种方法是对
日期
时间
进行排序,然后取平均值:

df['datetime'] =  pd.to_datetime(df['datetime'])
df['Date'] = [x.strftime('%Y-%m-%d') for x in df['datetime'].tolist()]
df['Time'] = ['%s:00' % x.strftime('%H') for x in df['datetime'].tolist()]
df_1 = df.groupby(['Date', 'Time', 'hostname']).mean()


编辑:请参见第二个示例,作者为Serge de Gosson de Varnnes。这正是我想要的

我相信我已经找到了解决问题的办法。我犯的第一个错误是没有按小时创建索引。我相信阿米特·库马尔(Amit Kumar)是在谈论这件事,但当时我不太明白他的意思。Serge de Gosson de Varnnes也为他的例子中的数据建立了一个索引

我将使用Serge de Gosson de Varnnes的我的数据插入示例,因此任何发现这一点的人都可以立即使用示例并检查输出:

import pandas as pd

d ={'datetime' :['2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15','2020-10-27 00:00:05','2020-10-27 00:00:10','2020-10-27 00:00:15','2020-10-27 01:00:05','2020-10-27 01:00:10','2020-10-27 01:00:15'],
   'hostname':['server001','server001','server001','server001','server001','server001','server002','server002','server002','server002','server002','server002'],
   'sessions':[ 22,25,21 ,30,30,35,15,10, 11,19,22,18]}       
df = pd.DataFrame(data=d)
df['datetime'] =  pd.to_datetime(df['datetime'])
df = df.set_index(pd.DatetimeIndex(df['datetime']))

hour_index = df.index.hour

df = groupby([hour_index, 'hostname'])['sessions'].mean().round(0).astype(int)

with pd.option_context(
        'display.max_rows',
         None,
         'display.max_columns',
         None
         ):
    print(df)
此处应用round和astype方法将整数四舍五入到最接近的整数。这不是我以前指定的东西,因为我已经知道如何处理它,但为了完整性,我将把它放在这里

这里的with语句允许打印完整的数据帧(请注意大数据帧,因为一次在屏幕上打印的数据可能很多)

输出:

datetime  hostname 
0         server001    23
          server002    12
1         server001    32
          server002    20
这里唯一的改进是将小时索引设置为带有时间戳的时钟格式


另一个问题没有解决,但超出了这个特定问题的范围,即datetime列中是否有多天。我将在每天每个数据帧中分离我的数据帧来处理这个问题。但是,如果我能找到更好的方法来处理每一天,我会将其添加到我的解决方案中。

为什么不从日期时间列中提取小时,并创建一个列和日期列(而不是日期时间),然后按“日期列”、“小时”、“主机名”分组?不幸的是,这只是整个会话的平均值。我需要的是按主机名分组,然后是它们各自会话的平均值。再看一遍,它可能会给出解决方案。由于某种原因,我没有看到最后一点,我想我正在查看的设备有问题。这正是我想要的!非常感谢。