Python 按天查看日志事件类别和组
我用熊猫来处理一些日志。我基本上将其处理为以下时间序列:Python 按天查看日志事件类别和组,python,logging,pandas,Python,Logging,Pandas,我用熊猫来处理一些日志。我基本上将其处理为以下时间序列: time 2014-03-18 17:00:25.266462 rt/top_rt 2014-03-18 17:00:25.722639 follow/retweeted 2014-03-18 17:00:26.773057 rt/top_rt 2014-03-18 17:00:28.077047 rt/top_rt 2014-03-18 17:00:28.904139
time
2014-03-18 17:00:25.266462 rt/top_rt
2014-03-18 17:00:25.722639 follow/retweeted
2014-03-18 17:00:26.773057 rt/top_rt
2014-03-18 17:00:28.077047 rt/top_rt
2014-03-18 17:00:28.904139 rt/top_rt
2014-03-18 17:00:29.512671 rt/top_rt
2014-03-18 17:00:29.640878 follow/retweeted
2014-03-18 21:00:30.087161 rt/top_rt
2014-03-18 21:00:30.272342 follow/retweeted
2014-03-18 21:00:31.284734 rt/top_rt
2014-03-18 21:00:31.467828 follow/retweeted
2014-03-18 21:00:33.955612 rt/top_rt
2014-03-18 21:00:35.810813 rt/top_rt
2014-03-18 21:00:37.710910 rt/top_rt
2014-03-18 21:00:38.200717 rt/top_rt
...
我想重点关注日志类别和分组。所以我想说的是:
day rt/top_rt follow/retweeted ...
2014-03-18 35 45
2014-03-19 67 90
2014-03-19 67 90
...
有几个选项(您可以使用df.pivot
,df.pivot\u table
,df.groupby
,df.unstack
),但是使用crosstab
似乎很简单(默认情况下会计算频率,):
假设您有一个带有日期时间索引的数据帧df
,以及一个列log
,您可以执行以下操作:
pd.crosstab(rows=df.index.date, cols=df['log'])
具体例子:
In [230]: s = """2014-03-18 17:00:25.266462, rt/top_rt
...: 2014-03-18 17:00:25.722639, follow/retweeted
...: 2014-03-18 17:00:26.773057, rt/top_rt
...: 2014-03-18 17:00:28.077047, rt/top_rt
...: 2014-03-18 17:00:28.904139, rt/top_rt
...: 2014-03-18 17:00:29.512671, rt/top_rt
...: 2014-03-18 17:00:29.640878, follow/retweeted
...: 2014-03-18 21:00:30.087161, rt/top_rt
...: 2014-03-18 21:00:30.272342, follow/retweeted
...: 2014-03-18 21:00:31.284734, rt/top_rt
...: 2014-03-18 21:00:31.467828, follow/retweeted
...: 2014-03-19 21:00:33.955612, rt/top_rt
...: 2014-03-19 21:00:35.810813, rt/top_rt
...: 2014-03-19 21:00:37.710910, rt/top_rt
...: 2014-03-19 21:00:38.200717, rt/top_rt"""
In [231]: df = pd.read_csv(StringIO(s), sep=",", header=None, index_col=0, names=['time', 'log'],
...: skipinitialspace=True, parse_dates=True)
In [232]: df
Out[232]:
log
time
2014-03-18 17:00:25.266462 rt/top_rt
2014-03-18 17:00:25.722639 follow/retweeted
2014-03-18 17:00:26.773057 rt/top_rt
2014-03-18 17:00:28.077047 rt/top_rt
2014-03-18 17:00:28.904139 rt/top_rt
2014-03-18 17:00:29.512671 rt/top_rt
2014-03-18 17:00:29.640878 follow/retweeted
2014-03-18 21:00:30.087161 rt/top_rt
2014-03-18 21:00:30.272342 follow/retweeted
2014-03-18 21:00:31.284734 rt/top_rt
2014-03-18 21:00:31.467828 follow/retweeted
2014-03-19 21:00:33.955612 rt/top_rt
2014-03-19 21:00:35.810813 rt/top_rt
2014-03-19 21:00:37.710910 rt/top_rt
2014-03-19 21:00:38.200717 rt/top_rt
In [233]: pd.crosstab(df.index.date, df['log'])
Out[233]:
log follow/retweeted rt/top_rt
row_0
2014-03-18 4 7
2014-03-19 0 4
你看过df.pivot/df.pivot_表/pd.crosstab了吗?这管用!谢谢最终的结果是一个叫做“row_0”的东西。这是怎么回事?哦,这是索引列的“默认”名称。明白了。
In [230]: s = """2014-03-18 17:00:25.266462, rt/top_rt
...: 2014-03-18 17:00:25.722639, follow/retweeted
...: 2014-03-18 17:00:26.773057, rt/top_rt
...: 2014-03-18 17:00:28.077047, rt/top_rt
...: 2014-03-18 17:00:28.904139, rt/top_rt
...: 2014-03-18 17:00:29.512671, rt/top_rt
...: 2014-03-18 17:00:29.640878, follow/retweeted
...: 2014-03-18 21:00:30.087161, rt/top_rt
...: 2014-03-18 21:00:30.272342, follow/retweeted
...: 2014-03-18 21:00:31.284734, rt/top_rt
...: 2014-03-18 21:00:31.467828, follow/retweeted
...: 2014-03-19 21:00:33.955612, rt/top_rt
...: 2014-03-19 21:00:35.810813, rt/top_rt
...: 2014-03-19 21:00:37.710910, rt/top_rt
...: 2014-03-19 21:00:38.200717, rt/top_rt"""
In [231]: df = pd.read_csv(StringIO(s), sep=",", header=None, index_col=0, names=['time', 'log'],
...: skipinitialspace=True, parse_dates=True)
In [232]: df
Out[232]:
log
time
2014-03-18 17:00:25.266462 rt/top_rt
2014-03-18 17:00:25.722639 follow/retweeted
2014-03-18 17:00:26.773057 rt/top_rt
2014-03-18 17:00:28.077047 rt/top_rt
2014-03-18 17:00:28.904139 rt/top_rt
2014-03-18 17:00:29.512671 rt/top_rt
2014-03-18 17:00:29.640878 follow/retweeted
2014-03-18 21:00:30.087161 rt/top_rt
2014-03-18 21:00:30.272342 follow/retweeted
2014-03-18 21:00:31.284734 rt/top_rt
2014-03-18 21:00:31.467828 follow/retweeted
2014-03-19 21:00:33.955612 rt/top_rt
2014-03-19 21:00:35.810813 rt/top_rt
2014-03-19 21:00:37.710910 rt/top_rt
2014-03-19 21:00:38.200717 rt/top_rt
In [233]: pd.crosstab(df.index.date, df['log'])
Out[233]:
log follow/retweeted rt/top_rt
row_0
2014-03-18 4 7
2014-03-19 0 4