Python 熊猫中优雅的groupby和更新?
我有以下Python 熊猫中优雅的groupby和更新?,python,pandas,Python,Pandas,我有以下pandas.DataFrame对象: offset ts op time 0 0.000000 2015-10-27 18:31:40.318 Decompress 2.953 1 0.000000 2015-10-27 18:31:40.318 DeserializeBond 0.015 32 0.000000 2015-10-27 18:31:40.318
pandas.DataFrame
对象:
offset ts op time
0 0.000000 2015-10-27 18:31:40.318 Decompress 2.953
1 0.000000 2015-10-27 18:31:40.318 DeserializeBond 0.015
32 0.000000 2015-10-27 18:31:40.318 Compress 17.135
33 0.000000 2015-10-27 18:31:40.318 BuildIndex 19.494
34 0.000000 2015-10-27 18:31:40.318 InsertIndex 0.625
35 0.000000 2015-10-27 18:31:40.318 Compress 16.970
36 0.000000 2015-10-27 18:31:40.318 BuildIndex 18.954
37 0.000000 2015-10-27 18:31:40.318 InsertIndex 0.047
38 0.000000 2015-10-27 18:31:40.318 Compress 16.017
39 0.000000 2015-10-27 18:31:40.318 BuildIndex 17.814
40 0.000000 2015-10-27 18:31:40.318 InsertIndex 0.047
77 4.960683 2015-10-27 18:36:37.959 Decompress 2.844
78 4.960683 2015-10-27 18:36:37.959 DeserializeBond 0.000
108 4.960683 2015-10-27 18:36:37.959 Compress 17.758
109 4.960683 2015-10-27 18:36:37.959 BuildIndex 19.742
110 4.960683 2015-10-27 18:36:37.959 InsertIndex 0.110
111 4.960683 2015-10-27 18:36:37.959 Compress 16.267
112 4.960683 2015-10-27 18:36:37.959 BuildIndex 18.111
113 4.960683 2015-10-27 18:36:37.959 InsertIndex 0.062
我想按(偏移量,ts,op)
字段分组,并汇总时间
值:
df = df.groupby(['offset', 'ts', 'op']).sum()
到目前为止一切顺利:
time
offset ts op
0.000000 2015-10-27 18:31:40.318 BuildIndex 56.262
Compress 50.122
Decompress 2.953
DeserializeBond 0.015
InsertIndex 0.719
4.960683 2015-10-27 18:36:37.959 BuildIndex 37.853
Compress 34.025
Decompress 2.844
DeserializeBond 0.000
InsertIndex 0.172
问题是,我必须在每组中从BuildIndex
中减去Compress
时间。为了使用DataFrame.xs()
,我提出了以下建议:
diff = df.xs("BuildIndex", level="op") - df.xs("Compress", level="op")
diff['op'] = 'BuildIndex'
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)
df.update(diff)
它确实起作用了,但我有一种强烈的感觉,那就是必须有一个更优雅的解决方案来解决这个问题
有人能建议一个更好的方法吗?注意:您的线路:
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)
没有必要,因为diff是不变的(因为它通过以前的groupby已经是唯一的)
一个小技巧是将
drop\u levels=False
与.value一起使用(因此在减去时忽略索引),这有点无礼,因为它假设每个组都有一个“BuildIndex”和一个“op”行,这可能不安全
In [11]: diff = df1.xs("BuildIndex", level="op", drop_level=False) - df1.xs("Compress", level="op").values
In [12]: diff
Out[12]:
time
offset ts op
2015-10-27 18:31:40.318 BuildIndex 6.140
18:36:37.959 BuildIndex 3.828
我很想在这里展开,因为数据实际上是二维的:
In [21]: res = df1.unstack("op")
In [22]: res
Out[22]:
time
op BuildIndex Compress Decompress DeserializeBond InsertIndex
offset ts
2015-10-27 18:31:40.318 56.262 50.122 2.953 0.015 0.719
18:36:37.959 37.853 34.025 2.844 0.000 0.172
不清楚这是否是一个多索引列:
In [23]: res.columns = res.columns.get_level_values(1)
In [24]: res
Out[24]:
op BuildIndex Compress Decompress DeserializeBond InsertIndex
offset ts
2015-10-27 18:31:40.318 56.262 50.122 2.953 0.015 0.719
18:36:37.959 37.853 34.025 2.844 0.000 0.172
那么减法就容易多了:
In [25]: res["BuildIndex"] - res["Compress"]
Out[25]:
offset ts
2015-10-27 18:31:40.318 6.140
18:36:37.959 3.828
dtype: float64
In [26]: res["BuildIndex"] = res["BuildIndex"] - res["Compress"]
我想这是最优雅的…太棒了!非常感谢你的帮助。事实证明,您可以将多层次列作为元组进行寻址,在取消堆叠后,只需编写:
df['time','BuildIndex']-=df['time','Compress']
。现在我很高兴:-)