在python/pandas中按月对每日数据进行分组,然后进行规范化

在python/pandas中按月对每日数据进行分组,然后进行规范化,python,pandas,Python,Pandas,我在熊猫数据框中有下表: q_string q_visits q_date 0 nucleus 1790 2012-10-02 00:00:00 1 neuron 364 2012-10-02 00:00:00 2 current 280 2012-10-02 00:00:00 3 molecular 259 2012-10-02 00:

我在熊猫
数据框中有下表:

    q_string    q_visits    q_date
0   nucleus         1790        2012-10-02 00:00:00
1   neuron          364         2012-10-02 00:00:00
2   current         280         2012-10-02 00:00:00
3   molecular       259         2012-10-02 00:00:00
4   stem            201         2012-10-02 00:00:00
该表按天包含服务器日志中的查询卷。我想做两件事:

  • 我想按月对查询进行分组,将整个月的查询量相加,例如,如果2012-10-02卷上有“分子”1000,2012-10-03卷上有“分子”500,那么在新表中应该有一个1500(卷)的条目,日期为2012-10-31(月末端点表示月份–转换表中的所有日期都将是月末,表示与之相关的整个月份)
  • 我想添加第五列,其中包含标准化
    q_访问量
    。即,一个术语的每月查询量除以该月所有术语的总查询量

  • 这样做的最佳方式是什么?

    如果我理解正确:

    对于(1)项,请执行以下操作:

    通过从您给出的值、一些随机日期和访问次数中取样,制作一些虚假数据:

    In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
    
    In [180]: visits = Series(poisson(1000, size=100), name='date')
    
    In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
    
    In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
    
    In [183]: df.head()
    Out[183]:
                     date   string  visits
    0 2001-11-15 00:00:00  current     997
    1 2001-11-15 00:00:00  current     974
    2 2012-10-02 00:00:00     stem     982
    3 2001-12-01 00:00:00     stem     984
    4 2001-01-01 00:00:00  current     989
    
    In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
    
    In [187]: resamp.head()
    Out[187]:
                        visits
    string  date
    current 2001-01-31    2996
            2001-02-28     NaN
            2001-03-31     NaN
            2001-04-30     NaN
            2001-05-31    3016
    
    NaN
    之所以存在,是因为在这几个月中没有使用该查询字符串的访问

    对于(2),按日期分组,然后除以总和:

    In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
    
    In [189]: g.head()
    Out[189]:
                        visits
    string  date
    current 2001-01-31   0.177
            2001-02-28     NaN
            2001-03-31     NaN
            2001-04-30     NaN
            2001-05-31   0.188
    
    只是为了让你相信(2)正在做你想做的事:

    In [176]: h = g.sortlevel('date').head()
    
    In [177]: h
    Out[177]:
                          visits
    string    date
    current   2001-01-31   0.077
    molecular 2001-01-31   0.228
    neuron    2001-01-31   0.073
    nucleus   2001-01-31   0.234
    stem      2001-01-31   0.388
    
    In [178]: h.sum()
    Out[178]:
    visits    1
    dtype: float64
    
    如果要将
    resamp
    转换为
    DataFrame
    并删除
    NaN
    s,请执行以下操作:

    In [196]: resamp.dropna()
    Out[196]:
                          visits
    string    date
    current   2001-01-31    2996
              2001-05-31    3016
              2001-11-30    5959
              2001-12-31    3998
              2013-09-30    1077
    molecular 2001-01-31    3984
              2001-05-31    1911
              2001-11-30    3054
              2001-12-31    1020
              2012-10-31     977
              2013-09-30    1947
    neuron    2001-01-31    3961
              2001-05-31    2069
              2001-11-30    5010
              2001-12-31    2065
              2012-10-31    6973
              2013-09-30     994
    nucleus   2001-01-31    3060
              2001-05-31    3035
              2001-11-30    2924
              2001-12-31    4144
              2012-10-31    2004
              2013-09-30    7881
    stem      2001-01-31    2911
              2001-05-31    5994
              2001-11-30    6072
              2001-12-31    4916
              2012-10-31    1991
              2013-09-30    3977
    
    In [197]: resamp.dropna().reset_index()
    Out[197]:
           string                date  visits
    0     current 2001-01-31 00:00:00    2996
    1     current 2001-05-31 00:00:00    3016
    2     current 2001-11-30 00:00:00    5959
    3     current 2001-12-31 00:00:00    3998
    4     current 2013-09-30 00:00:00    1077
    5   molecular 2001-01-31 00:00:00    3984
    6   molecular 2001-05-31 00:00:00    1911
    7   molecular 2001-11-30 00:00:00    3054
    8   molecular 2001-12-31 00:00:00    1020
    9   molecular 2012-10-31 00:00:00     977
    10  molecular 2013-09-30 00:00:00    1947
    11     neuron 2001-01-31 00:00:00    3961
    12     neuron 2001-05-31 00:00:00    2069
    13     neuron 2001-11-30 00:00:00    5010
    14     neuron 2001-12-31 00:00:00    2065
    15     neuron 2012-10-31 00:00:00    6973
    16     neuron 2013-09-30 00:00:00     994
    17    nucleus 2001-01-31 00:00:00    3060
    18    nucleus 2001-05-31 00:00:00    3035
    19    nucleus 2001-11-30 00:00:00    2924
    20    nucleus 2001-12-31 00:00:00    4144
    21    nucleus 2012-10-31 00:00:00    2004
    22    nucleus 2013-09-30 00:00:00    7881
    23       stem 2001-01-31 00:00:00    2911
    24       stem 2001-05-31 00:00:00    5994
    25       stem 2001-11-30 00:00:00    6072
    26       stem 2001-12-31 00:00:00    4916
    27       stem 2012-10-31 00:00:00    1991
    28       stem 2013-09-30 00:00:00    3977
    
    当然,您也可以为
    g
    执行此操作:

    In [198]: g.dropna()
    Out[198]:
                          visits
    string    date
    current   2001-01-31   0.177
              2001-05-31   0.188
              2001-11-30   0.259
              2001-12-31   0.248
              2013-09-30   0.068
    molecular 2001-01-31   0.236
              2001-05-31   0.119
              2001-11-30   0.133
              2001-12-31   0.063
              2012-10-31   0.082
              2013-09-30   0.123
    neuron    2001-01-31   0.234
              2001-05-31   0.129
              2001-11-30   0.218
              2001-12-31   0.128
              2012-10-31   0.584
              2013-09-30   0.063
    nucleus   2001-01-31   0.181
              2001-05-31   0.189
              2001-11-30   0.127
              2001-12-31   0.257
              2012-10-31   0.168
              2013-09-30   0.496
    stem      2001-01-31   0.172
              2001-05-31   0.374
              2001-11-30   0.264
              2001-12-31   0.305
              2012-10-31   0.167
              2013-09-30   0.251
    
    In [199]: g.dropna().reset_index()
    Out[199]:
           string                date  visits
    0     current 2001-01-31 00:00:00   0.177
    1     current 2001-05-31 00:00:00   0.188
    2     current 2001-11-30 00:00:00   0.259
    3     current 2001-12-31 00:00:00   0.248
    4     current 2013-09-30 00:00:00   0.068
    5   molecular 2001-01-31 00:00:00   0.236
    6   molecular 2001-05-31 00:00:00   0.119
    7   molecular 2001-11-30 00:00:00   0.133
    8   molecular 2001-12-31 00:00:00   0.063
    9   molecular 2012-10-31 00:00:00   0.082
    10  molecular 2013-09-30 00:00:00   0.123
    11     neuron 2001-01-31 00:00:00   0.234
    12     neuron 2001-05-31 00:00:00   0.129
    13     neuron 2001-11-30 00:00:00   0.218
    14     neuron 2001-12-31 00:00:00   0.128
    15     neuron 2012-10-31 00:00:00   0.584
    16     neuron 2013-09-30 00:00:00   0.063
    17    nucleus 2001-01-31 00:00:00   0.181
    18    nucleus 2001-05-31 00:00:00   0.189
    19    nucleus 2001-11-30 00:00:00   0.127
    20    nucleus 2001-12-31 00:00:00   0.257
    21    nucleus 2012-10-31 00:00:00   0.168
    22    nucleus 2013-09-30 00:00:00   0.496
    23       stem 2001-01-31 00:00:00   0.172
    24       stem 2001-05-31 00:00:00   0.374
    25       stem 2001-11-30 00:00:00   0.264
    26       stem 2001-12-31 00:00:00   0.305
    27       stem 2012-10-31 00:00:00   0.167
    28       stem 2013-09-30 00:00:00   0.251
    
    最后,如果要将列按不同的顺序排列,请使用
    reindex

    In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
    Out[210]:
        visits     string                date
    0    0.177    current 2001-01-31 00:00:00
    1    0.188    current 2001-05-31 00:00:00
    2    0.259    current 2001-11-30 00:00:00
    3    0.248    current 2001-12-31 00:00:00
    4    0.068    current 2013-09-30 00:00:00
    5    0.236  molecular 2001-01-31 00:00:00
    6    0.119  molecular 2001-05-31 00:00:00
    7    0.133  molecular 2001-11-30 00:00:00
    8    0.063  molecular 2001-12-31 00:00:00
    9    0.082  molecular 2012-10-31 00:00:00
    10   0.123  molecular 2013-09-30 00:00:00
    11   0.234     neuron 2001-01-31 00:00:00
    12   0.129     neuron 2001-05-31 00:00:00
    13   0.218     neuron 2001-11-30 00:00:00
    14   0.128     neuron 2001-12-31 00:00:00
    15   0.584     neuron 2012-10-31 00:00:00
    16   0.063     neuron 2013-09-30 00:00:00
    17   0.181    nucleus 2001-01-31 00:00:00
    18   0.189    nucleus 2001-05-31 00:00:00
    19   0.127    nucleus 2001-11-30 00:00:00
    20   0.257    nucleus 2001-12-31 00:00:00
    21   0.168    nucleus 2012-10-31 00:00:00
    22   0.496    nucleus 2013-09-30 00:00:00
    23   0.172       stem 2001-01-31 00:00:00
    24   0.374       stem 2001-05-31 00:00:00
    25   0.264       stem 2001-11-30 00:00:00
    26   0.305       stem 2001-12-31 00:00:00
    27   0.167       stem 2012-10-31 00:00:00
    28   0.251       stem 2013-09-30 00:00:00
    

    Phillip,这很好-对于(1)和(2)来说,第一列中有日期,第二列中有查询字符串,第三列中有月总量(这样就没有NaN)?两件事:第一,
    h
    是一个带有
    多索引的
    系列
    ,所以没有任何“列”,但您可以轻松地将其转换为
    数据帧
    。我会将其添加到我的答案中。第二,这样做不会删除
    NaN
    s。要删除
    NaN
    s,您可以调用
    dropna()
    。我会将其放在我的答案中。我认为这与唯一字符串的数量不太相称。它将是
    O(#个唯一字符串*#个月*每月观察次数)
    。对于每个唯一字符串,按
    个月数
    (最多12个唯一月)分组。假设每个加法操作为
    O(1)
    ,这就是复杂性。Phillip,再检查两件事,然后再回来。谢谢你的回答Phillip-教了我很多。有没有办法根据字符串保持最终df中的绝对月总量?