Python 按天查看日志事件类别和组

Python 按天查看日志事件类别和组,python,logging,pandas,Python,Logging,Pandas,我用熊猫来处理一些日志。我基本上将其处理为以下时间序列: time 2014-03-18 17:00:25.266462 rt/top_rt 2014-03-18 17:00:25.722639 follow/retweeted 2014-03-18 17:00:26.773057 rt/top_rt 2014-03-18 17:00:28.077047 rt/top_rt 2014-03-18 17:00:28.904139

我用熊猫来处理一些日志。我基本上将其处理为以下时间序列:

time
2014-03-18 17:00:25.266462           rt/top_rt
2014-03-18 17:00:25.722639    follow/retweeted
2014-03-18 17:00:26.773057           rt/top_rt
2014-03-18 17:00:28.077047           rt/top_rt
2014-03-18 17:00:28.904139           rt/top_rt
2014-03-18 17:00:29.512671           rt/top_rt
2014-03-18 17:00:29.640878    follow/retweeted
2014-03-18 21:00:30.087161           rt/top_rt
2014-03-18 21:00:30.272342    follow/retweeted
2014-03-18 21:00:31.284734           rt/top_rt
2014-03-18 21:00:31.467828    follow/retweeted
2014-03-18 21:00:33.955612           rt/top_rt
2014-03-18 21:00:35.810813           rt/top_rt
2014-03-18 21:00:37.710910           rt/top_rt
2014-03-18 21:00:38.200717           rt/top_rt
...
我想重点关注日志类别和分组。所以我想说的是:

day           rt/top_rt   follow/retweeted  ...
2014-03-18    35          45
2014-03-19    67          90
2014-03-19    67          90
...
有几个选项(您可以使用
df.pivot
df.pivot\u table
df.groupby
df.unstack
),但是使用
crosstab
似乎很简单(默认情况下会计算频率,):

假设您有一个带有日期时间索引的数据帧
df
,以及一个列
log
,您可以执行以下操作:

pd.crosstab(rows=df.index.date, cols=df['log'])
具体例子:

In [230]: s = """2014-03-18 17:00:25.266462,           rt/top_rt
     ...: 2014-03-18 17:00:25.722639,    follow/retweeted
     ...: 2014-03-18 17:00:26.773057,           rt/top_rt
     ...: 2014-03-18 17:00:28.077047,           rt/top_rt
     ...: 2014-03-18 17:00:28.904139,           rt/top_rt
     ...: 2014-03-18 17:00:29.512671,           rt/top_rt
     ...: 2014-03-18 17:00:29.640878,    follow/retweeted
     ...: 2014-03-18 21:00:30.087161,           rt/top_rt
     ...: 2014-03-18 21:00:30.272342,    follow/retweeted
     ...: 2014-03-18 21:00:31.284734,           rt/top_rt
     ...: 2014-03-18 21:00:31.467828,    follow/retweeted
     ...: 2014-03-19 21:00:33.955612,           rt/top_rt
     ...: 2014-03-19 21:00:35.810813,           rt/top_rt
     ...: 2014-03-19 21:00:37.710910,           rt/top_rt
     ...: 2014-03-19 21:00:38.200717,           rt/top_rt"""

In [231]: df = pd.read_csv(StringIO(s), sep=",", header=None, index_col=0, names=['time', 'log'], 
     ...:                  skipinitialspace=True, parse_dates=True)

In [232]: df
Out[232]: 
                                         log
time                                        
2014-03-18 17:00:25.266462         rt/top_rt
2014-03-18 17:00:25.722639  follow/retweeted
2014-03-18 17:00:26.773057         rt/top_rt
2014-03-18 17:00:28.077047         rt/top_rt
2014-03-18 17:00:28.904139         rt/top_rt
2014-03-18 17:00:29.512671         rt/top_rt
2014-03-18 17:00:29.640878  follow/retweeted
2014-03-18 21:00:30.087161         rt/top_rt
2014-03-18 21:00:30.272342  follow/retweeted
2014-03-18 21:00:31.284734         rt/top_rt
2014-03-18 21:00:31.467828  follow/retweeted
2014-03-19 21:00:33.955612         rt/top_rt
2014-03-19 21:00:35.810813         rt/top_rt
2014-03-19 21:00:37.710910         rt/top_rt
2014-03-19 21:00:38.200717         rt/top_rt

In [233]: pd.crosstab(df.index.date, df['log'])
Out[233]: 
log         follow/retweeted  rt/top_rt
row_0                                  
2014-03-18                 4          7
2014-03-19                 0          4

你看过df.pivot/df.pivot_表/pd.crosstab了吗?这管用!谢谢最终的结果是一个叫做“row_0”的东西。这是怎么回事?哦,这是索引列的“默认”名称。明白了。
In [230]: s = """2014-03-18 17:00:25.266462,           rt/top_rt
     ...: 2014-03-18 17:00:25.722639,    follow/retweeted
     ...: 2014-03-18 17:00:26.773057,           rt/top_rt
     ...: 2014-03-18 17:00:28.077047,           rt/top_rt
     ...: 2014-03-18 17:00:28.904139,           rt/top_rt
     ...: 2014-03-18 17:00:29.512671,           rt/top_rt
     ...: 2014-03-18 17:00:29.640878,    follow/retweeted
     ...: 2014-03-18 21:00:30.087161,           rt/top_rt
     ...: 2014-03-18 21:00:30.272342,    follow/retweeted
     ...: 2014-03-18 21:00:31.284734,           rt/top_rt
     ...: 2014-03-18 21:00:31.467828,    follow/retweeted
     ...: 2014-03-19 21:00:33.955612,           rt/top_rt
     ...: 2014-03-19 21:00:35.810813,           rt/top_rt
     ...: 2014-03-19 21:00:37.710910,           rt/top_rt
     ...: 2014-03-19 21:00:38.200717,           rt/top_rt"""

In [231]: df = pd.read_csv(StringIO(s), sep=",", header=None, index_col=0, names=['time', 'log'], 
     ...:                  skipinitialspace=True, parse_dates=True)

In [232]: df
Out[232]: 
                                         log
time                                        
2014-03-18 17:00:25.266462         rt/top_rt
2014-03-18 17:00:25.722639  follow/retweeted
2014-03-18 17:00:26.773057         rt/top_rt
2014-03-18 17:00:28.077047         rt/top_rt
2014-03-18 17:00:28.904139         rt/top_rt
2014-03-18 17:00:29.512671         rt/top_rt
2014-03-18 17:00:29.640878  follow/retweeted
2014-03-18 21:00:30.087161         rt/top_rt
2014-03-18 21:00:30.272342  follow/retweeted
2014-03-18 21:00:31.284734         rt/top_rt
2014-03-18 21:00:31.467828  follow/retweeted
2014-03-19 21:00:33.955612         rt/top_rt
2014-03-19 21:00:35.810813         rt/top_rt
2014-03-19 21:00:37.710910         rt/top_rt
2014-03-19 21:00:38.200717         rt/top_rt

In [233]: pd.crosstab(df.index.date, df['log'])
Out[233]: 
log         follow/retweeted  rt/top_rt
row_0                                  
2014-03-18                 4          7
2014-03-19                 0          4