Python 如何从numpy 2d中获取列值最大的行，并按其他列分组？_Python_Numpy

Python 如何从numpy 2d中获取列值最大的行，并按其他列分组？

python numpy

Python 如何从numpy 2d中获取列值最大的行，并按其他列分组？,python,numpy,Python,Numpy,这是非常常见的SQLquery：选择列X中具有最大值的行，分组依据group\u id 结果是对于每个组id，一（第一）行，其中列X值在组内最大我有一个包含许多列的2DNumPy数组，但让我们将其简化为（ID，X，Y）：我想得到： [[1 22 1236] [2 23 1111]] 我可以通过繁琐的循环来完成，比如： row_grouped_with_max = [] max_row = rows[0] last_max = max_row[1] last_row_g

这是非常常见的

SQL

query：

选择列
X
中具有最大值的行，分组依据
group\u id

结果是对于每个

组id

，一（第一）行，其中列

值在组内最大

我有一个包含许多列的

2D

NumPy

数组，但让我们将其简化为（

ID

，

）：

我想得到：

[[1 22 1236]
 [2 23 1111]]

我可以通过繁琐的循环来完成，比如：

  row_grouped_with_max = []

  max_row = rows[0]
  last_max = max_row[1]
  last_row_group = max_row[0]
  for row in rows:
    if last_max < row[1]:
        max_row = row
    if row[0] != last_row_group:      
      last_row_group = row[0]
      last_max = 0
      row_grouped_with_max.append(max_row)
  row_grouped_with_max.append(max_row)

row_与_max=[]
最大行数=行数[0]
最后一行=最大行[1]
最后一行\u组=最大行[0]
对于行中的行：
如果last_max<行[1]：
最大行=行
如果行[0]！=最后一行组：
最后一行\u组=行[0]
最后_max=0
行\u分组\u与\u max.append（最大行）
行\u分组\u与\u max.append（最大行）

如何以干净的

NumPy

方式执行此操作？

假设您有n列：

沿第一个轴使用a.max并解压缩值

x1max，x2max….xnmax=a.max（轴=0）

使用

pandas

库的备选方案（IMO，在那里更容易操作

ndarrays

）

可能不是很干净，但这里有一个矢量化的方法来解决它-

# Get sorted "rows"
sorted_rows = rows[np.argsort(rows[:,0])]

# Get count of elements for each ID
_,count = np.unique(sorted_rows[:,0],return_counts=True)

# Form mask to fill elements from X-column
N1 = count.max()
N2 = len(count)
mask = np.arange(N1) < count[:,None]

# Form a 2D matrix of ID's with each row for each unique ID
ID_2Darray = np.empty((N2,N1))
ID_2Darray.fill(-np.Inf)
ID_2Darray[mask] = sorted_rows[:,1]

# Get ID based max indices
grp_max_idx = np.argmax(ID_2Darray,axis=1) + np.append([0],count.cumsum()[:-1])

# Finally, get the "maxed"-X rows
out = sorted_rows[grp_max_idx]

这可以通过使用该软件包（免责声明：我是其作者）优雅且完全矢量化地解决：

您想要

[2 23 1111]

而不是

[2 23 1250]

？在你的max中Y是无意义的吗？@Scott是的，在这种情况下Y是无意义的，我有更多的列，没有必要尝试对它们进行排序。但我需要保留一条选定记录的所有列，并在组中保留最大值。如果有更多的熊猫具有相同的最大值，那么它们中的任何一个都可以。现在似乎所有的路都通向熊猫。谢谢，我将重新考虑熊猫的安装。二维，我将选择熊猫。但是多维数据非常多，我将坚持使用

numpy

，因为这是它的特色所在。然而，对于这样的快速咀嚼，

pandas

提供了最少的开销，至少对我来说是这样。

In [1]: import numpy as np
   ...: import pandas as pd

In [2]: rows = np.array([[1,22,1236],
   ...:                  [1,11,1563],
   ...:                  [2,13,1234],
   ...:                  [2,10,1224],
   ...:                  [2,23,1111],
   ...:                  [2,23,1250]])
   ...: print rows
[[   1   22 1236]
 [   1   11 1563]
 [   2   13 1234]
 [   2   10 1224]
 [   2   23 1111]
 [   2   23 1250]]

In [3]: df = pd.DataFrame(rows)
   ...: print df
   0   1     2
0  1  22  1236
1  1  11  1563
2  2  13  1234
3  2  10  1224
4  2  23  1111
5  2  23  1250

In [4]: g = df.groupby([0])[1].transform(max)
   ...: print g
0    22
1    22
2    23
3    23
4    23
5    23
dtype: int32

In [5]: df2 = df[df[1] == g]
   ...: print df2
   0   1     2
0  1  22  1236
4  2  23  1111
5  2  23  1250

In [6]: df3 = df2.drop_duplicates([1])
   ...: print df3
   0   1     2
0  1  22  1236
4  2  23  1111

In [7]: mtx = df3.as_matrix()
   ...: print mtx
[[   1   22 1236]
 [   2   23 1111]]

# Get sorted "rows"
sorted_rows = rows[np.argsort(rows[:,0])]

# Get count of elements for each ID
_,count = np.unique(sorted_rows[:,0],return_counts=True)

# Form mask to fill elements from X-column
N1 = count.max()
N2 = len(count)
mask = np.arange(N1) < count[:,None]

# Form a 2D matrix of ID's with each row for each unique ID
ID_2Darray = np.empty((N2,N1))
ID_2Darray.fill(-np.Inf)
ID_2Darray[mask] = sorted_rows[:,1]

# Get ID based max indices
grp_max_idx = np.argmax(ID_2Darray,axis=1) + np.append([0],count.cumsum()[:-1])

# Finally, get the "maxed"-X rows
out = sorted_rows[grp_max_idx]

In [101]: rows
Out[101]: 
array([[   2,   13, 1234],
       [   1,   22, 1236],
       [   2,   23, 1250],
       [   6,   12, 1345],
       [   4,   10,  290],
       [   2,   10, 1224],
       [   2,   23, 1111],
       [   4,   45,   99],
       [   1,   11, 1563],
       [   4,   23,   89]])

In [102]: out
Out[102]: 
array([[   1,   22, 1236],
       [   2,   23, 1250],
       [   4,   45,   99],
       [   6,   12, 1345]])

import numpy_indexed as npi
# sort rows by 2nd column
rows = rows[np.argsort(rows[:, 1])]
# group by is stable, so last item in each group is the one we are after
print(npi.group_by(rows[:, 0]).last(rows))