Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/300.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用pandas从日志文件分析生成会话_Python_Pandas_Timedelta_Dataframe - Fatal编程技术网

Python 使用pandas从日志文件分析生成会话

Python 使用pandas从日志文件分析生成会话,python,pandas,timedelta,dataframe,Python,Pandas,Timedelta,Dataframe,我正在分析一个Apache日志文件,并将其导入到一个数据帧中 “65.55.52.118——[30/May/2013:06:58:52-0600]“GET/detaileddven.php?refId=7954&uId=2802 HTTP/1.1”200 4514“-”Mozilla/5.0(兼容;bingbot/2.0;+)” 我的数据帧: 我想根据IP、代理和时差将其分组到会话中(如果持续时间大于30分钟,则应为新会话) 通过IP和代理对数据帧进行分组很容易,但是如何检查这个时差呢?希望问

我正在分析一个Apache日志文件,并将其导入到一个数据帧中

“65.55.52.118——[30/May/2013:06:58:52-0600]“GET/detaileddven.php?refId=7954&uId=2802 HTTP/1.1”200 4514“-”Mozilla/5.0(兼容;bingbot/2.0;+)”

我的数据帧:



我想根据IP、代理和时差将其分组到会话中(如果持续时间大于30分钟,则应为新会话)

通过IP和代理对数据帧进行分组很容易,但是如何检查这个时差呢?希望问题清楚

sessions = df.groupby(['IP', 'Agent']).size()
更新:df.index如下所示:

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-05-30 06:00:41, ..., 2013-05-30 22:29:14]
Length: 31975, Freq: None, Timezone: None

[2013-05-30 06:00:41, ..., 2013-05-30 22:29:14]
长度:31975,频率:无,时区:无
我会使用a和a(这里是一个简单的例子,用数字代替时间-但它们的工作原理完全相同):

*对skipna=False的需求似乎是一个bug

然后,您可以将其用于:


现在,您可以按
'ip'
'session\u number'
进行分组(并分析每个会话)。

安迪·海登的回答既可爱又简洁,但如果您有大量用户/ip地址进行分组,则会变得非常缓慢。这里有另一种方法,更丑陋,但也更快

import pandas as pd
import numpy as np

sample = lambda x: np.random.choice(x, size=10000)
df = pd.DataFrame({'ip': sample(range(500)), 
                   'time': sample([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])})
max_diff = 0.5 # Max time difference

def method_1(df):
    df = df.sort_values('time')
    g = df.groupby('ip')
    df['session'] = g['time'].apply(
        lambda s: (s - s.shift(1) > max_diff).fillna(0).cumsum(skipna=False)
        )
    return df['session']


def method_2(df):
    # Sort by ip then time 
    df = df.sort_values(['ip', 'time'])

    # Get locations where the ip changes 
    ip_change = df.ip != df.ip.shift()
    time_or_ip_change = (df.time - df.time.shift() > max_diff) | ip_change
    df['session'] = time_or_ip_change.cumsum()

    # The cumsum operated over the whole series, so subtract out the first 
    # value for each IP
    df['tmp'] = 0
    df.loc[ip_change, 'tmp'] = df.loc[ip_change, 'session']
    df['tmp'] = np.maximum.accumulate(df.tmp)
    df['session'] = df.session - df.tmp

    # Delete the temporary column
    del df['tmp']
    return df['session']

r1 = method_1(df)
r2 = method_2(df)

assert (r1.sort_index() == r2.sort_index()).all()

%timeit method_1(df)
%timeit method_2(df)

400 ms ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.6 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

谢谢你,安迪!过了很长一段时间,我得到了一个答案:),但为什么我会出现这个错误呢?AttributeError:“Timestamp”对象没有属性“shift”@NilaniAlgiriyage看起来您试图将shift应用于时间戳而不是列/序列(但不确定是如何实现的)。df['tval']=df.index df['delta']=(df['tval']-df['tval'].shift(1)>30)。fillna(0)。cumsum(skipna=False)上述代码正确吗?但它给出了另一个类型错误?上面的代码对我来说很有用。。。您的代码看起来不错,但我认为您应该使用
pd.offset.Minute(30).nanos,而不是
30
。您能确认
类型(df['tval'])的结果吗
?以及您的pandas版本(适用于0.11)。
In [21]: df = pd.DataFrame([[1.1, 1.7, 2.5, 2.6, 2.7, 3.4], list('AAABBB')]).T

In [22]: df.columns = ['time', 'ip']

In [23]: df
Out[23]:
  time ip
0  1.1  A
1  1.7  A
2  2.5  A
3  2.6  B
4  2.7  B
5  3.4  B

In [24]: g = df.groupby('ip')

In [25]: df['session_number'] = g['time'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))

In [26]: df
Out[26]:
  time ip  session_number
0  1.1  A               0
1  1.7  A               1
2  2.5  A               2
3  2.6  B               0
4  2.7  B               0
5  3.4  B               1
import pandas as pd
import numpy as np

sample = lambda x: np.random.choice(x, size=10000)
df = pd.DataFrame({'ip': sample(range(500)), 
                   'time': sample([1., 1.1, 1.2, 2.7, 3.2, 3.8, 3.9])})
max_diff = 0.5 # Max time difference

def method_1(df):
    df = df.sort_values('time')
    g = df.groupby('ip')
    df['session'] = g['time'].apply(
        lambda s: (s - s.shift(1) > max_diff).fillna(0).cumsum(skipna=False)
        )
    return df['session']


def method_2(df):
    # Sort by ip then time 
    df = df.sort_values(['ip', 'time'])

    # Get locations where the ip changes 
    ip_change = df.ip != df.ip.shift()
    time_or_ip_change = (df.time - df.time.shift() > max_diff) | ip_change
    df['session'] = time_or_ip_change.cumsum()

    # The cumsum operated over the whole series, so subtract out the first 
    # value for each IP
    df['tmp'] = 0
    df.loc[ip_change, 'tmp'] = df.loc[ip_change, 'session']
    df['tmp'] = np.maximum.accumulate(df.tmp)
    df['session'] = df.session - df.tmp

    # Delete the temporary column
    del df['tmp']
    return df['session']

r1 = method_1(df)
r2 = method_2(df)

assert (r1.sort_index() == r2.sort_index()).all()

%timeit method_1(df)
%timeit method_2(df)

400 ms ± 195 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.6 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)