Python 获取比较多个列的最大值并返回特定值_Python_Python 3.x_Pandas_Dataframe

Python 获取比较多个列的最大值并返回特定值

python python-3.x pandas dataframe

Python 获取比较多个列的最大值并返回特定值,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有一个数据帧，如： Sequence Duration1 Value1 Duration2 Value2 Duration3 Value3 1001 145 10 125 53 458 33 1002 475 20 175 54 652 45 1003 685 57 68

我有一个数据帧，如：

Sequence    Duration1   Value1  Duration2   Value2  Duration3   Value3
1001        145         10      125         53      458         33
1002        475         20      175         54      652         45
1003        685         57      687         87      254         88
1004        125         54      175         96      786         96
1005        475         21      467         32      526         32
1006        325         68      301         54      529         41
1007        125         97      325         85      872         78
1008        129         15      429         41      981         82
1009        547         47      577         52      543         83
1010        666         65      722         63      257         87

我想在（Duration1、Duration2、Duration3）中找到Duration的最大值，并返回相应的值和序列

我的期望输出：

Sequence,Duration3,Value3
1008,    981,      82

可以使用以下方法获取列最大值的索引：

>>> idx = df['Duration3'].idxmax()
>>> idx
7

和相关列仅使用：

>>> df_cols = df[['Sequence', 'Duration3', 'Value3']]
>>> df_cols.loc[idx]
Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

因此，只需将所有这些都封装到一个漂亮的函数中：

def get_max(df, i):
    idx = df[f'Duration{i}'].idxmax()
    df_cols = df[['Sequence', f'Duration{i}', f'Value{i}']]
    return df_cols.loc[idx]

并在

1..3上循环：
>>> max_rows = [get_max(i) for i in range(1, 4)]
>>> print('\n\n'.join(map(str, max_rows)))
Sequence     1003
Duration1     685
Value1         57
Name: 2, dtype: int64

Sequence     1010
Duration2     722
Value2         63
Name: 9, dtype: int64

Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

如果要将这3行减少到单个最大行，可以执行以下操作：
>>> pairs = enumerate(max_rows, 1)
>>> by_duration = lambda x: x[1][f'Duration{x[0]}']
>>> i, max_row = max(pairs, key=by_duration)
>>> max_row
Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

如果我正确理解问题，考虑到以下数据帧：
df = pd.DataFrame(data={'Seq': [1, 2, 3], 'Dur1': [2, 7, 3],'Val1': ['x', 'y', 'z'],'Dur2': [3, 5, 1], 'Val2': ['a', 'b', 'c']})
    Seq  Dur1 Val1  Dur2 Val2
0    1     2    x     3    a
1    2     7    y     5    b
2    3     3    z     1    c

这5行代码解决了您的问题：
dur_col = [col_name for col_name in df.columns if col_name.startswith('Dur')] # ['Dur1', 'Dur2'] 
max_dur_name = df.loc[:, dur_col].max().idxmax()
val_name = "Val" + str([int(s) for s in max_dur_name if s.isdigit()][0])

filter_col = ['Seq', max_dur_name, val_name]

df_res = df[filter_col].sort_values(max_dur_name, ascending=False).head(1)

你会得到：
   Seq  Dur1 Val1 
1    2     7    y  

代码说明：
我会自动获取以“Dur”开头的列，并找到持续时间较长的列名：
dur_col = [col_name for col_name in df.columns if col_name.startswith('Dur')] # ['Dur1', 'Dur2'] 
max_dur_name = df.loc[:, dur_col].max().idxmax()
val_name = "Val" + str([int(s) for s in max_dur_name if s.isdigit()][0])

选择我感兴趣的专栏：
filter_col = ['Seq', max_dur_name, val_name]

过滤我感兴趣的列，我为max\u dur\u name
排序，然后得到搜索结果：
df_res = df[filter_col].sort_values(max_dur_name, ascending=False).head(1)

# output:
   Seq  Dur1 Val1 
1    2     7    y   

请尝试以下主要基于Numpy的非常简短的代码：
结果是一系列：
如果您想“重新调整”它（首先是索引值，然后是实际值），
您可以得到如下结果：
pd.DataFrame([result.values], columns=result.index)

这是另一种方式
m=df.set_index('Sequence') #set Sequence as index
n=m.filter(like='Duration') #gets all columns with the name Duration
s=n.idxmax()[n.eq(n.values.max()).any()]
#output Duration3    1008
d = dict(zip(m.columns[::2],m.columns[1::2])) #create a mapper dict
#{'Duration1': 'Value1', 'Duration2': 'Value2', 'Duration3': 'Value3'}
final=m.loc[s.values,s.index.union(s.index.map(d))].reset_index()


不使用numpy向导：

首先，对于这个问题，有一些非常好的解决方案，是由其他人提出的
数据将是问题中提供的数据，如df

#在持续时间列中查找最大值
max_value=max（df.filter（比如='Dur'，axis=1.max（）.tolist（））
#为最大值获取数据帧的布尔匹配
df_max=df[df==mv]
#获取行索引
max_index=df_max.dropna（how='all'）.index[0]
#获取列名
max_col=df_max.dropna（轴=1，how='all'）。列[0]
#获取列索引
max\u col\u index=df.columns.get\u loc（max\u col）
#决赛
iloc[max\u index[0，max\u col\u index，max\u col\u index+1]]

输出：
序列1008
期限3981
价值3 82
名称：7，数据类型：int64

更新

昨晚，实际上是凌晨4点，我拒绝了一个更好的解决方案，因为我太累了。

我使用max\u value=max（df.filter（比如'Dur'，axis=1.max（）.tolist（））
，返回持续时间
列中的最大值
而不是max\u col\u name=df.filter（比如class='Dur'，axis=1）.max（）.idxmax（）
，返回出现最大值的列名
我这样做是因为我的大脑告诉我，我返回的是列名的最大值，而不是列中的最大值。例如：


test=['Duration5'，'Duration2'，'Duration3']
打印（最大值（测试））
>>>“持续时间5”


这就是为什么过度劳累是解决问题的糟糕条件
有了睡眠和咖啡，这是一个更有效的解决方案

与其他类似，在使用idmax


新的和改进的解决方案：
#具有最大持续时间值的列名
max\u col\u name=df.filter（比如='Dur'，axis=1.max（）.idxmax（）
#最大列名称索引
max\u col\u idx=df.columns.get\u loc（max\u col\u name）
#最大列名称中最大值的行索引
max\u row\u idx=df[max\u col\u name].idxmax（）
#带.loc的输出
df.iloc[max_row_idx，[0，max_col_idx，max_col_idx+1]]

输出：
序列1008
期限3981
价值3 82
名称：7，数据类型：int64

使用的方法：






有点类似于，但我认为差异足够大，值得添加
mvc = df[[name for name in df.columns if 'Duration' in name]].max().idxmax()
mvidx = df[mvc].idxmax()
valuecol = 'Value' + mvc[-1]
df.loc[mvidx, ['Sequence', mvc, valuecol]]

首先，我得到最大值所在的列名mvc
（mvc
是'Durantion3'
，如下示例）
然后我得到最大值的行索引mvidx
（mvidx
is7
）
然后我构建正确的值列（valuecol
是'Value3'
）
最后，使用loc
I选择所需的输出，即：
Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64


对于宽数据，可以更容易地首先使用wide\u to_long
进行重塑。这将创建两列['Duration'，'Value']
，多索引告诉我们它是哪个数字。不依赖于任何特定的列顺序
import pandas as pd

df = pd.wide_to_long(df, i='Sequence', j='num', stubnames=['Duration', 'Value'])
df.loc[[df.Duration.idxmax()]]

              Duration  Value
Sequence num                 
1008     3         981     82

主席先生，是否可以只过滤最大持续时间，结果为“Sequence，Duration3，Value3”“1008，981，82”主席先生，我的要求是，如果Dur1具有最大值，则输出将只有“Seq”，“Dur1”Val1。如果Dur2具有最大值，则输出将是“Seq”，“Dur2”Val2，尽管这在很大程度上取决于列的顺序（为了安全起见，我想在开始时使用.reindex可以确保这一点）先生，我非常喜欢你的回答。我有一个类似的问题，需要你的帮助。应该注意的是，这会在数据帧上进行大量的冗余计算。@MateenUlhaq我想这次聚会更多的是看看有多少方法可以解决这个问题。这不是最优雅的解决方案，但我很满意我学到了一些东西从我的努力和其他答案中。还有，你的个人资料上有一些很棒的照片。
   Sequence  Duration3  Value3
0      1008        981      82

mvc = df[[name for name in df.columns if 'Duration' in name]].max().idxmax()
mvidx = df[mvc].idxmax()
valuecol = 'Value' + mvc[-1]
df.loc[mvidx, ['Sequence', mvc, valuecol]]

Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

import pandas as pd

df = pd.wide_to_long(df, i='Sequence', j='num', stubnames=['Duration', 'Value'])
df.loc[[df.Duration.idxmax()]]

              Duration  Value
Sequence num                 
1008     3         981     82