Python 使用pandas选择每个groupby group列中的最大N_Python_Pandas_Group By

Python 使用pandas选择每个groupby group列中的最大N

python pandas

Python 使用pandas选择每个groupby group列中的最大N,python,pandas,group-by,Python,Pandas,Group By,我的df：我想做2个groupby操作，并使用columnp234\u r\u c取每组中最大的1个第一组成员=['plant1\u-type'，'plant2\u-type'，'city1'] 第二组依据=['plant1\u类型'、'plant2\u类型'、'city2'] 因此，我做了以下工作： {'city1': {0: 'Chicago', 1: 'Chicago', 2: 'Chicago', 3: 'Chicago', 4: 'Miami', 5: 'Hou

我的df：

我想做2个groupby操作，并使用column

p234\u r\u c

取每组中最大的1个

第一组成员=

['plant1\u-type'，'plant2\u-type'，'city1']

第二组依据=

['plant1\u类型'、'plant2\u类型'、'city2']

因此，我做了以下工作：

{'city1': {0: 'Chicago',
  1: 'Chicago',
  2: 'Chicago',
  3: 'Chicago',
  4: 'Miami',
  5: 'Houston',
  6: 'Austin'},
 'city2': {0: 'Toronto',
  1: 'Detroit',
  2: 'St.Louis',
  3: 'Miami',
  4: 'Dallas',
  5: 'Dallas',
  6: 'Dallas'},
 'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
 'plant1_type': {0: 'COMBCYCL',
  1: 'COMBCYCL',
  2: 'NUKE',
  3: 'COAL',
  4: 'NUKE',
  5: 'COMBCYCL',
  6: 'COAL'},
 'plant2_type': {0: 'COAL',
  1: 'COAL',
  2: 'COMBCYCL',
  3: 'COMBCYCL',
  4: 'COAL',
  5: 'NUKE',
  6: 'NUKE'}}

第一组的结果是有意义的。然而，我对第二组的结果感到困惑：

df.groupby(['plant1_type','plant2_type','city1'])['p234_r_c'].\
    nlargest(1).reset_index()


plant1_type plant2_type city1   level_3 p234_r_c
0   COAL    COMBCYCL    Chicago 3   0.5
1   COAL    NUKE        Austin  6   3.0
2   COMBCYCL    COAL    Chicago 0   5.0
3   COMBCYCL    NUKE    Houston 5   4.0
4   NUKE    COAL        Miami   4   1.0
5   NUKE    COMBCYCL    Chicago 2   2.0

结果中的

plant1\u type

、

plant2\u type

和

city2

列发生了什么变化？它们是否应该出现在结果中，就像第一组的结果中出现的

plant1\u type

、

plant2\u type

和

city1

一样？

理论：

当

pd.Series

上的

groupby

结果返回相同的

pd.Series

值时，则返回原始索引

简化的例子

df.groupby(['plant1_type','plant2_type','city2'])['p234_r_c'].\
    nlargest(1).reset_index()

index   p234_r_c
0   0   5.0
1   1   4.0
2   2   2.0
3   3   0.5
4   4   1.0
5   5   4.0
6   6   3.0

我认为您希望这些返回相同的一致性索引

这是最令人震惊的后果：

df = pd.DataFrame(dict(A=[0, 1, 2, 3]))

# returns results identical to df.A
print(df.groupby(df.A // 2).A.nsmallest(2))

# returns results out of order
print(df.groupby(df.A // 2).A.nlargest(2))

0    0
1    1
2    2
3    3
Name: A, dtype: int64
A   
0  1    1
   0    0
1  3    3
   2    2
Name: A, dtype: int64

一次执行时返回此值

# most egregious
# this will be randomly different
print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=2))

这是另一个

A   
0  1    1
   0    0
1  2    2
   3    3
Name: A, dtype: int64

当然，这从来没有问题，因为不可能返回与原始值相同的值

0    0
1    1
2    2
3    3
Name: A, dtype: int64

解决问题

设置索引

print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=1))

A   
0  0    0
1  2    2
Name: A, dtype: int64

很可能你找到了一个好的发现@编码Knobnice研究，换句话说，如果没有聚合，则出错-

df.groupby（['plant1_-type'，'plant2_-type'，'city1']）['p234_-r_-c']。\nlargest（2）.重置索引（）

，但如果

则有效-

df.groupby（['plant1_-type'，'plant2_-type'，'city1']））['p234_-r_-c']。\nlargest（1）.重置索引（）

-存在聚合。伙计们，在此期间有没有办法解决该漏洞？我如何返回索引以便使用该索引对原始数据帧进行切片？我想知道为“['plant1_type'、'plant2_type'、'city2']`分组选择了哪个

city1

，为

['plant1_type'、'plant2_type'、'city1']选择了哪个city2groupby@piRSquared-变通办法并非在所有情况下都有效。请参见中介绍的解决方法
cols = ['plant1_type','plant2_type','city2']
df.set_index(cols).groupby(level=cols)['p234_r_c'].\
    nlargest(1).reset_index()

  plant1_type plant2_type     city2  p234_r_c
0    COMBCYCL        COAL   Toronto       5.0
1    COMBCYCL        COAL   Detroit       4.0
2        NUKE    COMBCYCL  St.Louis       2.0
3        COAL    COMBCYCL     Miami       0.5
4        NUKE        COAL    Dallas       1.0
5    COMBCYCL        NUKE    Dallas       4.0
6        COAL        NUKE    Dallas       3.0