Python 如何根据数据框中的top-K值识别行和列_Python_Numpy_Pandas

Python 如何根据数据框中的top-K值识别行和列

python numpy pandas

Python 如何根据数据框中的top-K值识别行和列,python,numpy,pandas,Python,Numpy,Pandas,我有一个这样创建的数据框： import pandas as pd d = {'gene' : ['foo', 'qux', 'bar', 'bin'], 'one' : [1., 2., 3., 1.], 'two' : [4., 3., 2., 1.], 'three' : [1., 2., 20., 1.], } df = pd.DataFrame(d) # # List top 5 values # ndf = df[['one','two',

我有一个这样创建的数据框：

import pandas as pd
d = {'gene' : ['foo', 'qux', 'bar', 'bin'],
     'one' : [1., 2., 3., 1.],
     'two' : [4., 3., 2., 1.],
     'three' : [1., 2., 20., 1.],
     }

df = pd.DataFrame(d)

# # List top 5 values
# ndf = df[['one','two','three']]
# top = ndf.values.flatten().tolist()
# top.sort(reverse=True)
# top[0:5]
# [20.0, 4.0, 3.0, 3.0, 2.0]

看起来是这样的：

In [58]: df
Out[58]:
  gene  one  three  two
0  foo    1      1    4
1  qux    2      2    3
2  bar    3     20    2
3  bin    1      1    1

{'foo':['two'],
'qux':['one','two','three'],
'bar':['one','two','three']}

我要做的是从第2列开始折叠所有值。获取前5名分数并确定所选行的相应行/列：

然后，所需的字典将如下所示：

In [58]: df
Out[58]:
  gene  one  three  two
0  foo    1      1    4
1  qux    2      2    3
2  bar    3     20    2
3  bin    1      1    1

{'foo':['two'],
'qux':['one','two','three'],
'bar':['one','two','three']}

我怎样才能做到这一点呢？

这是一个有效但不干净的解决方案

top5=top[0:5]
dt=df.set_index('gene').T
d={}
for col in dt.columns:
    idx_list=dt[col][dt[col].isin(top5)].index.tolist()
    if idx_list:
        d[col]=idx_list 
d

会回来的

{'bar': ['one', 'three', 'two'],
 'foo': ['two'],
 'qux': ['one', 'three', 'two']}

这是有效但不干净的解决方案

top5=top[0:5]
dt=df.set_index('gene').T
d={}
for col in dt.columns:
    idx_list=dt[col][dt[col].isin(top5)].index.tolist()
    if idx_list:
        d[col]=idx_list 
d

会回来的

{'bar': ['one', 'three', 'two'],
 'foo': ['two'],
 'qux': ['one', 'three', 'two']}

在开始之前，我将

gene

列设置为索引。这样可以更容易地隔离数字列（就像您使用

ndf

所做的那样），并且更容易在以后返回字典：

df.set_index('gene', inplace=True)

然后，我将分两步进行

首先，通过

numpy

获得第五大值，本着以下精神：

使用

partition

可以避免对整个数组进行排序（就像使用

top

一样），当数组较大时，这可能会导致成本高昂

Second，

应用lambda
函数检索列名：
df.apply(lambda row: row.index[row >= n_max].tolist(), axis=1).to_dict()

请注意，由于每行都是一个系列，因此行的索引是数据帧的列。结果:
{'bar': ['one', 'three', 'two'],
 'bin': [],
 'foo': ['two'],
 'qux': ['one', 'three', 'two']}

在开始之前，我将gene
列设置为索引。这样可以更容易地隔离数字列（就像您使用ndf
所做的那样），并且更容易在以后返回字典：
df.set_index('gene', inplace=True)

然后，我将分两步进行
首先，通过numpy
获得第五大值，本着以下精神：
使用partition
可以避免对整个数组进行排序（就像使用top
一样），当数组较大时，这可能会导致成本高昂
Second，应用lambda
函数检索列名：
df.apply(lambda row: row.index[row >= n_max].tolist(), axis=1).to_dict()

请注意，由于每行都是一个系列，因此行的索引是数据帧的列。结果:
{'bar': ['one', 'three', 'two'],
 'bin': [],
 'foo': ['two'],
 'qux': ['one', 'three', 'two']}

您可以堆叠数据帧，然后获得最大的5个值（我使用rank，因为您似乎希望包含所有的关系），然后按基因分组以获得字典
In [2]: df_stack = df.set_index('gene').stack()

In [3]: df_top = df_stack.loc[df_stack.rank('min', ascending=False) <= 5]

In [4]: print df_top.reset_index(0).groupby('gene').groups
{'qux': ['one', 'three', 'two'], 'foo': ['two'], 'bar': ['one', 'three', 'two']}

[2]中的：df_stack=df.set_index（'gene'）.stack（）
在[3]：df_top=df_stack.loc[df_stack.rank（'min'，升序=False）中，您可以对数据帧进行堆栈，然后获得最大的5个值（我使用rank，因为它似乎要包括所有关系），然后按基因分组以获得字典
In [2]: df_stack = df.set_index('gene').stack()

In [3]: df_top = df_stack.loc[df_stack.rank('min', ascending=False) <= 5]

In [4]: print df_top.reset_index(0).groupby('gene').groups
{'qux': ['one', 'three', 'two'], 'foo': ['two'], 'bar': ['one', 'three', 'two']}

[2]中的：df_stack=df.set_index（'gene'）.stack（）
[3]中：df_top=df_stack.loc[df_stack.rank（'min'，升序=False）