Python 熊猫中优雅的groupby和更新？_Python_Pandas

Python 熊猫中优雅的groupby和更新？

python pandas

Python 熊猫中优雅的groupby和更新？,python,pandas,Python,Pandas,我有以下pandas.DataFrame对象： offset ts op time 0 0.000000 2015-10-27 18:31:40.318 Decompress 2.953 1 0.000000 2015-10-27 18:31:40.318 DeserializeBond 0.015 32 0.000000 2015-10-27 18:31:40.318

我有以下

pandas.DataFrame

对象：

       offset                      ts               op    time
0    0.000000 2015-10-27 18:31:40.318       Decompress   2.953
1    0.000000 2015-10-27 18:31:40.318  DeserializeBond   0.015
32   0.000000 2015-10-27 18:31:40.318         Compress  17.135
33   0.000000 2015-10-27 18:31:40.318       BuildIndex  19.494
34   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.625
35   0.000000 2015-10-27 18:31:40.318         Compress  16.970
36   0.000000 2015-10-27 18:31:40.318       BuildIndex  18.954
37   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.047
38   0.000000 2015-10-27 18:31:40.318         Compress  16.017
39   0.000000 2015-10-27 18:31:40.318       BuildIndex  17.814
40   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.047
77   4.960683 2015-10-27 18:36:37.959       Decompress   2.844
78   4.960683 2015-10-27 18:36:37.959  DeserializeBond   0.000
108  4.960683 2015-10-27 18:36:37.959         Compress  17.758
109  4.960683 2015-10-27 18:36:37.959       BuildIndex  19.742
110  4.960683 2015-10-27 18:36:37.959      InsertIndex   0.110
111  4.960683 2015-10-27 18:36:37.959         Compress  16.267
112  4.960683 2015-10-27 18:36:37.959       BuildIndex  18.111
113  4.960683 2015-10-27 18:36:37.959      InsertIndex   0.062

我想按

（偏移量，ts，op）

字段分组，并汇总

时间

值：

df = df.groupby(['offset', 'ts', 'op']).sum()

到目前为止一切顺利：

                                                    time
offset   ts                      op                     
0.000000 2015-10-27 18:31:40.318 BuildIndex       56.262
                                 Compress         50.122
                                 Decompress        2.953
                                 DeserializeBond   0.015
                                 InsertIndex       0.719
4.960683 2015-10-27 18:36:37.959 BuildIndex       37.853
                                 Compress         34.025
                                 Decompress        2.844
                                 DeserializeBond   0.000
                                 InsertIndex       0.172

问题是，我必须在每组中从

BuildIndex

中减去

Compress

时间。为了使用

DataFrame.xs（）

，我提出了以下建议：

diff = df.xs("BuildIndex", level="op") - df.xs("Compress", level="op")
diff['op'] = 'BuildIndex'
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)
df.update(diff)

它确实起作用了，但我有一种强烈的感觉，那就是必须有一个更优雅的解决方案来解决这个问题

有人能建议一个更好的方法吗？

注意：您的线路：

diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)

没有必要，因为diff是不变的（因为它通过以前的groupby已经是唯一的）

一个小技巧是将

drop\u levels=False

与.value一起使用（因此在减去时忽略索引），这有点无礼，因为它假设每个组都有一个“BuildIndex”和一个“op”行，这可能不安全

In [11]: diff = df1.xs("BuildIndex", level="op", drop_level=False) - df1.xs("Compress", level="op").values

In [12]: diff
Out[12]:
                                     time
offset     ts           op
2015-10-27 18:31:40.318 BuildIndex  6.140
           18:36:37.959 BuildIndex  3.828

我很想在这里展开，因为数据实际上是二维的：

In [21]: res = df1.unstack("op")

In [22]: res
Out[22]:
                              time
op                      BuildIndex Compress Decompress DeserializeBond InsertIndex
offset     ts
2015-10-27 18:31:40.318     56.262   50.122      2.953           0.015       0.719
           18:36:37.959     37.853   34.025      2.844           0.000       0.172

不清楚这是否是一个多索引列：

In [23]: res.columns = res.columns.get_level_values(1)

In [24]: res
Out[24]:
op                       BuildIndex  Compress  Decompress  DeserializeBond  InsertIndex
offset     ts
2015-10-27 18:31:40.318      56.262    50.122       2.953            0.015        0.719
           18:36:37.959      37.853    34.025       2.844            0.000        0.172

那么减法就容易多了：

In [25]: res["BuildIndex"] - res["Compress"]
Out[25]:
offset      ts
2015-10-27  18:31:40.318    6.140
            18:36:37.959    3.828
dtype: float64

In [26]: res["BuildIndex"] = res["BuildIndex"] - res["Compress"]

我想这是最优雅的…

太棒了！非常感谢你的帮助。事实证明，您可以将多层次列作为元组进行寻址，在取消堆叠后，只需编写：

df['time'，'BuildIndex']-=df['time'，'Compress']

。现在我很高兴：-）