Python dataframe：返回最大值的行和列_Python_Pandas

Python dataframe：返回最大值的行和列

python pandas

Python dataframe：返回最大值的行和列,python,pandas,Python,Pandas,我有一个数据帧，其中所有的值都是相同的（例如，一个相关矩阵——但我们期望一个唯一的最大值）。我想返回这个矩阵最大值的行和列我可以通过更改 df.idxmax() 但是，我还没有找到一种合适的方法来返回整个数据帧最大值的行/列索引例如，我可以在numpy中执行此操作： >>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]]) >>>np.where(npa == np.amax(npa)) (array([1]), arra

我有一个数据帧，其中所有的值都是相同的（例如，一个相关矩阵——但我们期望一个唯一的最大值）。我想返回这个矩阵最大值的行和列

我可以通过更改

df.idxmax()

但是，我还没有找到一种合适的方法来返回整个数据帧最大值的行/列索引

例如，我可以在numpy中执行此操作：

>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))

但当我在熊猫身上尝试类似的东西时：

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
    a   b   c
d NaN NaN NaN
e NaN   9 NaN
f NaN NaN NaN

在第二级，我真正想做的是返回前n个值的行和列，例如作为一个系列
例如，对于上述内容，我想要一个功能：

>>>topn(df,3) b e c f b f dtype: object >>>type(topn(df,3)) pandas.core.series.Series
甚至只是

>>>topn(df,3) (['b','c','b'],['e','f','f'])

我想对于您正在尝试执行的操作，数据帧可能不是最佳选择，因为数据帧中的列的思想是保存独立的数据

>>> def topn(df,n): # pull the data ouit of the DataFrame # and flatten it to an array vals = df.values.flatten(order='F') # next we sort the array and store the sort mask p = np.argsort(vals) # create two arrays with the column names and indexes # in the same order as vals cols = np.array([[col]*len(df.index) for col in df.columns]).flatten() idxs = np.array([list(df.index) for idx in df.index]).flatten() # sort and return cols, and idxs return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1] >>> topn(df,3) (array(['b', 'c', 'b'], dtype='|S1'), array(['e', 'f', 'f'], dtype='|S1')) >>> %timeit(topn(df,3)) 10000 loops, best of 3: 29.9 µs per loop
watsonics解决方案需要的时间稍微少一些

%timeit(topn(df,3)) 10000 loops, best of 3: 24.6 µs per loop
但比堆栈快得多

def topStack(df,n): df = df.stack() df.sort(ascending=False) return df.head(n) %timeit(topStack(df,3)) 1000 loops, best of 3: 1.91 ms per loop

我想出了第一部分：

npa = df.as_matrix() cols,indx = np.where(npa == np.amax(npa)) ([df.columns[c] for c in cols],[df.index[c] for c in indx])
现在我需要一种方法来获得前n。一个天真的想法是复制数组，并用
NaN
在执行时抓取索引迭代地替换顶级值。似乎效率低下。是否有更好的方法获取numpy数组的前n个值？幸运的是，如图所示，通过
argpartition
，我们必须使用扁平索引

def topn(df,n): npa = df.as_matrix() topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering return ([df.columns[c] for c in cols],[df.index[i] for i in indx])
在示例中尝试此操作：

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def')) >>>topn(df,3) (['b', 'c', 'b'], ['e', 'f', 'f'])

如所愿。请注意，排序最初不是要求的，但是如果
n
不是很大的话，它会提供很少的开销。
您想要使用的是
堆栈

df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def')) df = df.stack() df.sort(ascending=False) df.head(4) e b 9 f c 8 b 7 a 6 dtype: int64

很抱歉，这在几个方面都是错误的，不能解决我的问题。我添加了一些明确的示例以使其更清楚。另外，与
df.max（）
相比，df.descripe（）.loc['max']更容易实现，嘿@greole，用一种完全不同的方式来冷却！为了社区利益，您能解释一下您的代码吗？有些行有点晦涩，例如，
*[[col]*len（df）表示df.列中的col]）
和
多索引。从元组（zip（cols，val.index））
可以一件一件地取下，或者用注释补充？另外，如果您将其打包为函数，我们可以比较运行时。。。谢谢你的不同方式@watsonic ok我对它进行了一些重构并添加了一些注释。避免在数据帧上操作本身就给了它巨大的速度优势。应该
topn（df，3）
返回
（['b'，'c'，'a'，'e'，'f'，'f']）
或
（['b'，'c'，'b'，'b'，['e'，'f'，'f']）
后者
df['a']['f']
=6而
df['b']['f']
=7这似乎不适用于任意数据帧，
topn（pd.DataFrame（np.random.randn（40,5）），4）
给了我一个
索引器：索引超出范围
错误