Python 获取满足条件的行的列的最大值_Python_Python 3.x_Pandas

Python 获取满足条件的行的列的最大值

python python-3.x pandas

Python 获取满足条件的行的列的最大值,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有一个如下所示的数据帧： | Age | Married | OwnsHouse | | 23 | True | False | | 35 | True | True | | 14 | False | False | | 27 | True | True | 我想找到结婚并拥有房子的人的最高年龄。答案是35。我的第一个想法是： df_subset = df[df['Married'] == True and df['OwnsH

我有一个如下所示的数据帧：

| Age | Married | OwnsHouse |
| 23  | True    | False     |
| 35  | True    | True      |
| 14  | False   | False     |
| 27  | True    | True      |

我想找到结婚并拥有房子的人的最高年龄。答案是35。我的第一个想法是：

df_subset = df[df['Married'] == True and df['OwnsHouse'] == True]
max_age = df_subset.max()

max_age = 0
for index, row in df.iterrows():
    if(row[index]['Married] and row['index']['OwnsHouse'] and row[index]['Age] > max_age):
    max_age = row[index]['Age']

然而，数据集很大（50MB），我担心这会在计算上很昂贵，因为它会遍历数据集两次

我的第二个想法是：

df_subset = df[df['Married'] == True and df['OwnsHouse'] == True]
max_age = df_subset.max()

max_age = 0
for index, row in df.iterrows():
    if(row[index]['Married] and row['index']['OwnsHouse'] and row[index]['Age] > max_age):
    max_age = row[index]['Age']

有没有更快的方法

您的第一种方法是可靠的，但这里有一个简单的选择：

df[df['Married'] & df['OwnsHouse']].max()

Age          35.0
Married       1.0
OwnsHouse     1.0
dtype: float64

或者，仅仅是年龄：

df.loc[df['Married'] & df['OwnsHouse'], 'Age'].max()
# 35

如果有多个布尔列，我建议使用更具伸缩性的列

df[df[['Married', 'OwnsHouse']].all(axis=1)].max()

Age          35.0
Married       1.0
OwnsHouse     1.0
dtype: float64

在哪里,

df[['Married', 'OwnsHouse']].all(axis=1)

0    False
1     True
2    False
3     True
dtype: bool

这和,

df['Married'] & df['OwnsHouse']

0    False
1     True
2    False
3     True
dtype: bool

但是，与其手动查找N个布尔掩码的AND，不如让

。所有的都为您这样做
query
是另一个选项：
df.query("Married and OwnsHouse")['Age'].max()
# 35

它不需要计算掩码的中间步骤

您的方法足够快，但如果您想进行微优化，以下是numpy的更多选项：
# <= 0.23
df[(df['Married'].values & df['OwnsHouse'].values)].max()
df[df[['Married', 'OwnsHouse']].values.all(axis=1)].max()
# 0.24+
df[(df['Married'].to_numpy() & df['OwnsHouse'].to_numpy())].max()
df[df[['Married', 'OwnsHouse']].to_numpy().all(axis=1)].max()

Age          35.0
Married       1.0
OwnsHouse     1.0
dtype: float64

如果你想要更多的裸体，可以这样做：
df.loc[(
   df['Married'].to_numpy() & df['OwnsHouse'].to_numpy()), 'Age'
].to_numpy().max()
# 35

或者更好的办法是扔掉熊猫
df['Age'].to_numpy()[df['Married'].to_numpy() & df['OwnsHouse'].to_numpy()].max()
# 35

你的第一个想法就是要走的路。50MB实际上很小。不建议采用第二种方法。你可以做df.loc[df['marred']&df['OwnsHouse']，'Age'].max（）
。你检查过这两种方法的性能了吗？df.iterrows
是pandas中的反模式，它的性能通常比任何向量化方法或逻辑索引都差。声明一个50MB的中间结果df_subset=df[df['marred']==True和df['OwnsHouse']==True]
是不必要的，并且无故浪费CPU和内存。正如@QuangHoang所示，您应该链接调用。实际上，len（“df['Married']&df['OwnsHouse']）==30
和len（“df[['Married'，'OwnsHouse']]。all（axis=1）”）==40
：D@QuangHoang很好，但是df[col1]&df[col2]&。。。df[col100]
？：D@0xPrateek你的意思是我计时了使用numpy函数是否比等效的pandas方法快？对不过在这里可能会有更大的不同。@Quanghaang谢谢！我用过手榴弹发射器，而我可能应该用苍蝇拍，但我对细节水平相当满意。@cs95啊，我错了。它与R和C都不同