在python/pandas中按月对每日数据进行分组，然后进行规范化_Python_Pandas

在python/pandas中按月对每日数据进行分组，然后进行规范化
python pandas
在python/pandas中按月对每日数据进行分组，然后进行规范化,python,pandas,Python,Pandas,我在熊猫数据框中有下表： q_string q_visits q_date 0 nucleus 1790 2012-10-02 00:00:00 1 neuron 364 2012-10-02 00:00:00 2 current 280 2012-10-02 00:00:00 3 molecular 259 2012-10-02 00:
我在熊猫
数据框中有下表：
    q_string    q_visits    q_date
0   nucleus         1790        2012-10-02 00:00:00
1   neuron          364         2012-10-02 00:00:00
2   current         280         2012-10-02 00:00:00
3   molecular       259         2012-10-02 00:00:00
4   stem            201         2012-10-02 00:00:00

该表按天包含服务器日志中的查询卷。我想做两件事：
我想按月对查询进行分组，将整个月的查询量相加，例如，如果2012-10-02卷上有“分子”1000，2012-10-03卷上有“分子”500，那么在新表中应该有一个1500（卷）的条目，日期为2012-10-31（月末端点表示月份–转换表中的所有日期都将是月末，表示与之相关的整个月份）
我想添加第五列，其中包含月标准化q_访问量
。即，一个术语的每月查询量除以该月所有术语的总查询量
这样做的最佳方式是什么？
如果我理解正确：
对于（1）项，请执行以下操作：
通过从您给出的值、一些随机日期和访问次数中取样，制作一些虚假数据：
In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')

In [180]: visits = Series(poisson(1000, size=100), name='date')

In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')

In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})

In [183]: df.head()
Out[183]:
                 date   string  visits
0 2001-11-15 00:00:00  current     997
1 2001-11-15 00:00:00  current     974
2 2012-10-02 00:00:00     stem     982
3 2001-12-01 00:00:00     stem     984
4 2001-01-01 00:00:00  current     989

In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')

In [187]: resamp.head()
Out[187]:
                    visits
string  date
current 2001-01-31    2996
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31    3016

NaN
之所以存在，是因为在这几个月中没有使用该查询字符串的访问
对于（2），按日期分组，然后除以总和：
In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())

In [189]: g.head()
Out[189]:
                    visits
string  date
current 2001-01-31   0.177
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31   0.188

只是为了让你相信（2）正在做你想做的事：
In [176]: h = g.sortlevel('date').head()

In [177]: h
Out[177]:
                      visits
string    date
current   2001-01-31   0.077
molecular 2001-01-31   0.228
neuron    2001-01-31   0.073
nucleus   2001-01-31   0.234
stem      2001-01-31   0.388

In [178]: h.sum()
Out[178]:
visits    1
dtype: float64

如果要将resamp
转换为DataFrame
并删除NaN
s，请执行以下操作：
In [196]: resamp.dropna()
Out[196]:
                      visits
string    date
current   2001-01-31    2996
          2001-05-31    3016
          2001-11-30    5959
          2001-12-31    3998
          2013-09-30    1077
molecular 2001-01-31    3984
          2001-05-31    1911
          2001-11-30    3054
          2001-12-31    1020
          2012-10-31     977
          2013-09-30    1947
neuron    2001-01-31    3961
          2001-05-31    2069
          2001-11-30    5010
          2001-12-31    2065
          2012-10-31    6973
          2013-09-30     994
nucleus   2001-01-31    3060
          2001-05-31    3035
          2001-11-30    2924
          2001-12-31    4144
          2012-10-31    2004
          2013-09-30    7881
stem      2001-01-31    2911
          2001-05-31    5994
          2001-11-30    6072
          2001-12-31    4916
          2012-10-31    1991
          2013-09-30    3977

In [197]: resamp.dropna().reset_index()
Out[197]:
       string                date  visits
0     current 2001-01-31 00:00:00    2996
1     current 2001-05-31 00:00:00    3016
2     current 2001-11-30 00:00:00    5959
3     current 2001-12-31 00:00:00    3998
4     current 2013-09-30 00:00:00    1077
5   molecular 2001-01-31 00:00:00    3984
6   molecular 2001-05-31 00:00:00    1911
7   molecular 2001-11-30 00:00:00    3054
8   molecular 2001-12-31 00:00:00    1020
9   molecular 2012-10-31 00:00:00     977
10  molecular 2013-09-30 00:00:00    1947
11     neuron 2001-01-31 00:00:00    3961
12     neuron 2001-05-31 00:00:00    2069
13     neuron 2001-11-30 00:00:00    5010
14     neuron 2001-12-31 00:00:00    2065
15     neuron 2012-10-31 00:00:00    6973
16     neuron 2013-09-30 00:00:00     994
17    nucleus 2001-01-31 00:00:00    3060
18    nucleus 2001-05-31 00:00:00    3035
19    nucleus 2001-11-30 00:00:00    2924
20    nucleus 2001-12-31 00:00:00    4144
21    nucleus 2012-10-31 00:00:00    2004
22    nucleus 2013-09-30 00:00:00    7881
23       stem 2001-01-31 00:00:00    2911
24       stem 2001-05-31 00:00:00    5994
25       stem 2001-11-30 00:00:00    6072
26       stem 2001-12-31 00:00:00    4916
27       stem 2012-10-31 00:00:00    1991
28       stem 2013-09-30 00:00:00    3977

当然，您也可以为g
执行此操作：
In [198]: g.dropna()
Out[198]:
                      visits
string    date
current   2001-01-31   0.177
          2001-05-31   0.188
          2001-11-30   0.259
          2001-12-31   0.248
          2013-09-30   0.068
molecular 2001-01-31   0.236
          2001-05-31   0.119
          2001-11-30   0.133
          2001-12-31   0.063
          2012-10-31   0.082
          2013-09-30   0.123
neuron    2001-01-31   0.234
          2001-05-31   0.129
          2001-11-30   0.218
          2001-12-31   0.128
          2012-10-31   0.584
          2013-09-30   0.063
nucleus   2001-01-31   0.181
          2001-05-31   0.189
          2001-11-30   0.127
          2001-12-31   0.257
          2012-10-31   0.168
          2013-09-30   0.496
stem      2001-01-31   0.172
          2001-05-31   0.374
          2001-11-30   0.264
          2001-12-31   0.305
          2012-10-31   0.167
          2013-09-30   0.251

In [199]: g.dropna().reset_index()
Out[199]:
       string                date  visits
0     current 2001-01-31 00:00:00   0.177
1     current 2001-05-31 00:00:00   0.188
2     current 2001-11-30 00:00:00   0.259
3     current 2001-12-31 00:00:00   0.248
4     current 2013-09-30 00:00:00   0.068
5   molecular 2001-01-31 00:00:00   0.236
6   molecular 2001-05-31 00:00:00   0.119
7   molecular 2001-11-30 00:00:00   0.133
8   molecular 2001-12-31 00:00:00   0.063
9   molecular 2012-10-31 00:00:00   0.082
10  molecular 2013-09-30 00:00:00   0.123
11     neuron 2001-01-31 00:00:00   0.234
12     neuron 2001-05-31 00:00:00   0.129
13     neuron 2001-11-30 00:00:00   0.218
14     neuron 2001-12-31 00:00:00   0.128
15     neuron 2012-10-31 00:00:00   0.584
16     neuron 2013-09-30 00:00:00   0.063
17    nucleus 2001-01-31 00:00:00   0.181
18    nucleus 2001-05-31 00:00:00   0.189
19    nucleus 2001-11-30 00:00:00   0.127
20    nucleus 2001-12-31 00:00:00   0.257
21    nucleus 2012-10-31 00:00:00   0.168
22    nucleus 2013-09-30 00:00:00   0.496
23       stem 2001-01-31 00:00:00   0.172
24       stem 2001-05-31 00:00:00   0.374
25       stem 2001-11-30 00:00:00   0.264
26       stem 2001-12-31 00:00:00   0.305
27       stem 2012-10-31 00:00:00   0.167
28       stem 2013-09-30 00:00:00   0.251

最后，如果要将列按不同的顺序排列，请使用reindex
：
In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
    visits     string                date
0    0.177    current 2001-01-31 00:00:00
1    0.188    current 2001-05-31 00:00:00
2    0.259    current 2001-11-30 00:00:00
3    0.248    current 2001-12-31 00:00:00
4    0.068    current 2013-09-30 00:00:00
5    0.236  molecular 2001-01-31 00:00:00
6    0.119  molecular 2001-05-31 00:00:00
7    0.133  molecular 2001-11-30 00:00:00
8    0.063  molecular 2001-12-31 00:00:00
9    0.082  molecular 2012-10-31 00:00:00
10   0.123  molecular 2013-09-30 00:00:00
11   0.234     neuron 2001-01-31 00:00:00
12   0.129     neuron 2001-05-31 00:00:00
13   0.218     neuron 2001-11-30 00:00:00
14   0.128     neuron 2001-12-31 00:00:00
15   0.584     neuron 2012-10-31 00:00:00
16   0.063     neuron 2013-09-30 00:00:00
17   0.181    nucleus 2001-01-31 00:00:00
18   0.189    nucleus 2001-05-31 00:00:00
19   0.127    nucleus 2001-11-30 00:00:00
20   0.257    nucleus 2001-12-31 00:00:00
21   0.168    nucleus 2012-10-31 00:00:00
22   0.496    nucleus 2013-09-30 00:00:00
23   0.172       stem 2001-01-31 00:00:00
24   0.374       stem 2001-05-31 00:00:00
25   0.264       stem 2001-11-30 00:00:00
26   0.305       stem 2001-12-31 00:00:00
27   0.167       stem 2012-10-31 00:00:00
28   0.251       stem 2013-09-30 00:00:00

Phillip，这很好-对于（1）和（2）来说，第一列中有日期，第二列中有查询字符串，第三列中有月总量（这样就没有NaN）？两件事：第一，h
是一个带有多索引的系列
，所以没有任何“列”，但您可以轻松地将其转换为数据帧
。我会将其添加到我的答案中。第二，这样做不会删除NaN
s。要删除NaN
s，您可以调用dropna（）
。我会将其放在我的答案中。我认为这与唯一字符串的数量不太相称。它将是O（#个唯一字符串*#个月*每月观察次数）
。对于每个唯一字符串，按个月数
（最多12个唯一月）分组。假设每个加法操作为O（1）
，这就是复杂性。Phillip，再检查两件事，然后再回来。谢谢你的回答Phillip-教了我很多。有没有办法根据字符串保持最终df中的绝对月总量？