在python/pandas中按月对每日数据进行分组,然后进行规范化
我在熊猫在python/pandas中按月对每日数据进行分组,然后进行规范化,python,pandas,Python,Pandas,我在熊猫数据框中有下表: q_string q_visits q_date 0 nucleus 1790 2012-10-02 00:00:00 1 neuron 364 2012-10-02 00:00:00 2 current 280 2012-10-02 00:00:00 3 molecular 259 2012-10-02 00:
数据框中有下表:
q_string q_visits q_date
0 nucleus 1790 2012-10-02 00:00:00
1 neuron 364 2012-10-02 00:00:00
2 current 280 2012-10-02 00:00:00
3 molecular 259 2012-10-02 00:00:00
4 stem 201 2012-10-02 00:00:00
该表按天包含服务器日志中的查询卷。我想做两件事:
我想按月对查询进行分组,将整个月的查询量相加,例如,如果2012-10-02卷上有“分子”1000,2012-10-03卷上有“分子”500,那么在新表中应该有一个1500(卷)的条目,日期为2012-10-31(月末端点表示月份–转换表中的所有日期都将是月末,表示与之相关的整个月份)
我想添加第五列,其中包含月标准化q_访问量
。即,一个术语的每月查询量除以该月所有术语的总查询量
这样做的最佳方式是什么?如果我理解正确:
对于(1)项,请执行以下操作:
通过从您给出的值、一些随机日期和访问次数中取样,制作一些虚假数据:
In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
In [180]: visits = Series(poisson(1000, size=100), name='date')
In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
In [183]: df.head()
Out[183]:
date string visits
0 2001-11-15 00:00:00 current 997
1 2001-11-15 00:00:00 current 974
2 2012-10-02 00:00:00 stem 982
3 2001-12-01 00:00:00 stem 984
4 2001-01-01 00:00:00 current 989
In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
In [187]: resamp.head()
Out[187]:
visits
string date
current 2001-01-31 2996
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 3016
NaN
之所以存在,是因为在这几个月中没有使用该查询字符串的访问
对于(2),按日期分组,然后除以总和:
In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
In [189]: g.head()
Out[189]:
visits
string date
current 2001-01-31 0.177
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 0.188
只是为了让你相信(2)正在做你想做的事:
In [176]: h = g.sortlevel('date').head()
In [177]: h
Out[177]:
visits
string date
current 2001-01-31 0.077
molecular 2001-01-31 0.228
neuron 2001-01-31 0.073
nucleus 2001-01-31 0.234
stem 2001-01-31 0.388
In [178]: h.sum()
Out[178]:
visits 1
dtype: float64
如果要将resamp
转换为DataFrame
并删除NaN
s,请执行以下操作:
In [196]: resamp.dropna()
Out[196]:
visits
string date
current 2001-01-31 2996
2001-05-31 3016
2001-11-30 5959
2001-12-31 3998
2013-09-30 1077
molecular 2001-01-31 3984
2001-05-31 1911
2001-11-30 3054
2001-12-31 1020
2012-10-31 977
2013-09-30 1947
neuron 2001-01-31 3961
2001-05-31 2069
2001-11-30 5010
2001-12-31 2065
2012-10-31 6973
2013-09-30 994
nucleus 2001-01-31 3060
2001-05-31 3035
2001-11-30 2924
2001-12-31 4144
2012-10-31 2004
2013-09-30 7881
stem 2001-01-31 2911
2001-05-31 5994
2001-11-30 6072
2001-12-31 4916
2012-10-31 1991
2013-09-30 3977
In [197]: resamp.dropna().reset_index()
Out[197]:
string date visits
0 current 2001-01-31 00:00:00 2996
1 current 2001-05-31 00:00:00 3016
2 current 2001-11-30 00:00:00 5959
3 current 2001-12-31 00:00:00 3998
4 current 2013-09-30 00:00:00 1077
5 molecular 2001-01-31 00:00:00 3984
6 molecular 2001-05-31 00:00:00 1911
7 molecular 2001-11-30 00:00:00 3054
8 molecular 2001-12-31 00:00:00 1020
9 molecular 2012-10-31 00:00:00 977
10 molecular 2013-09-30 00:00:00 1947
11 neuron 2001-01-31 00:00:00 3961
12 neuron 2001-05-31 00:00:00 2069
13 neuron 2001-11-30 00:00:00 5010
14 neuron 2001-12-31 00:00:00 2065
15 neuron 2012-10-31 00:00:00 6973
16 neuron 2013-09-30 00:00:00 994
17 nucleus 2001-01-31 00:00:00 3060
18 nucleus 2001-05-31 00:00:00 3035
19 nucleus 2001-11-30 00:00:00 2924
20 nucleus 2001-12-31 00:00:00 4144
21 nucleus 2012-10-31 00:00:00 2004
22 nucleus 2013-09-30 00:00:00 7881
23 stem 2001-01-31 00:00:00 2911
24 stem 2001-05-31 00:00:00 5994
25 stem 2001-11-30 00:00:00 6072
26 stem 2001-12-31 00:00:00 4916
27 stem 2012-10-31 00:00:00 1991
28 stem 2013-09-30 00:00:00 3977
当然,您也可以为g
执行此操作:
In [198]: g.dropna()
Out[198]:
visits
string date
current 2001-01-31 0.177
2001-05-31 0.188
2001-11-30 0.259
2001-12-31 0.248
2013-09-30 0.068
molecular 2001-01-31 0.236
2001-05-31 0.119
2001-11-30 0.133
2001-12-31 0.063
2012-10-31 0.082
2013-09-30 0.123
neuron 2001-01-31 0.234
2001-05-31 0.129
2001-11-30 0.218
2001-12-31 0.128
2012-10-31 0.584
2013-09-30 0.063
nucleus 2001-01-31 0.181
2001-05-31 0.189
2001-11-30 0.127
2001-12-31 0.257
2012-10-31 0.168
2013-09-30 0.496
stem 2001-01-31 0.172
2001-05-31 0.374
2001-11-30 0.264
2001-12-31 0.305
2012-10-31 0.167
2013-09-30 0.251
In [199]: g.dropna().reset_index()
Out[199]:
string date visits
0 current 2001-01-31 00:00:00 0.177
1 current 2001-05-31 00:00:00 0.188
2 current 2001-11-30 00:00:00 0.259
3 current 2001-12-31 00:00:00 0.248
4 current 2013-09-30 00:00:00 0.068
5 molecular 2001-01-31 00:00:00 0.236
6 molecular 2001-05-31 00:00:00 0.119
7 molecular 2001-11-30 00:00:00 0.133
8 molecular 2001-12-31 00:00:00 0.063
9 molecular 2012-10-31 00:00:00 0.082
10 molecular 2013-09-30 00:00:00 0.123
11 neuron 2001-01-31 00:00:00 0.234
12 neuron 2001-05-31 00:00:00 0.129
13 neuron 2001-11-30 00:00:00 0.218
14 neuron 2001-12-31 00:00:00 0.128
15 neuron 2012-10-31 00:00:00 0.584
16 neuron 2013-09-30 00:00:00 0.063
17 nucleus 2001-01-31 00:00:00 0.181
18 nucleus 2001-05-31 00:00:00 0.189
19 nucleus 2001-11-30 00:00:00 0.127
20 nucleus 2001-12-31 00:00:00 0.257
21 nucleus 2012-10-31 00:00:00 0.168
22 nucleus 2013-09-30 00:00:00 0.496
23 stem 2001-01-31 00:00:00 0.172
24 stem 2001-05-31 00:00:00 0.374
25 stem 2001-11-30 00:00:00 0.264
26 stem 2001-12-31 00:00:00 0.305
27 stem 2012-10-31 00:00:00 0.167
28 stem 2013-09-30 00:00:00 0.251
最后,如果要将列按不同的顺序排列,请使用reindex
:
In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
visits string date
0 0.177 current 2001-01-31 00:00:00
1 0.188 current 2001-05-31 00:00:00
2 0.259 current 2001-11-30 00:00:00
3 0.248 current 2001-12-31 00:00:00
4 0.068 current 2013-09-30 00:00:00
5 0.236 molecular 2001-01-31 00:00:00
6 0.119 molecular 2001-05-31 00:00:00
7 0.133 molecular 2001-11-30 00:00:00
8 0.063 molecular 2001-12-31 00:00:00
9 0.082 molecular 2012-10-31 00:00:00
10 0.123 molecular 2013-09-30 00:00:00
11 0.234 neuron 2001-01-31 00:00:00
12 0.129 neuron 2001-05-31 00:00:00
13 0.218 neuron 2001-11-30 00:00:00
14 0.128 neuron 2001-12-31 00:00:00
15 0.584 neuron 2012-10-31 00:00:00
16 0.063 neuron 2013-09-30 00:00:00
17 0.181 nucleus 2001-01-31 00:00:00
18 0.189 nucleus 2001-05-31 00:00:00
19 0.127 nucleus 2001-11-30 00:00:00
20 0.257 nucleus 2001-12-31 00:00:00
21 0.168 nucleus 2012-10-31 00:00:00
22 0.496 nucleus 2013-09-30 00:00:00
23 0.172 stem 2001-01-31 00:00:00
24 0.374 stem 2001-05-31 00:00:00
25 0.264 stem 2001-11-30 00:00:00
26 0.305 stem 2001-12-31 00:00:00
27 0.167 stem 2012-10-31 00:00:00
28 0.251 stem 2013-09-30 00:00:00
Phillip,这很好-对于(1)和(2)来说,第一列中有日期,第二列中有查询字符串,第三列中有月总量(这样就没有NaN)?两件事:第一,h
是一个带有多索引的系列
,所以没有任何“列”,但您可以轻松地将其转换为数据帧
。我会将其添加到我的答案中。第二,这样做不会删除NaN
s。要删除NaN
s,您可以调用dropna()
。我会将其放在我的答案中。我认为这与唯一字符串的数量不太相称。它将是O(#个唯一字符串*#个月*每月观察次数)
。对于每个唯一字符串,按个月数
(最多12个唯一月)分组。假设每个加法操作为O(1)
,这就是复杂性。Phillip,再检查两件事,然后再回来。谢谢你的回答Phillip-教了我很多。有没有办法根据字符串保持最终df中的绝对月总量?