Python 根据其他列的条件添加具有值的列_Python_Pandas

Python 根据其他列的条件添加具有值的列

python pandas

Python 根据其他列的条件添加具有值的列,python,pandas,Python,Pandas,我有以下数据帧：并希望添加一个名为“is_rich”的额外专栏，该专栏将记录一个人是否根据其工资而富有。我找到了多种方法来实现这一点： # method 1 df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no') # method 2 df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']] # method 3

我有以下数据帧：

并希望添加一个名为“is_rich”的额外专栏，该专栏将记录一个人是否根据其工资而富有。我找到了多种方法来实现这一点：

# method 1
df['is_rich_method1'] = np.where(df['salary']>=50, 'yes', 'no')

# method 2
df['is_rich_method2'] = ['yes' if x >= 50 else 'no' for x in df['salary']]

# method 3
df['is_rich_method3'] = 'no'
df.loc[df['salary'] > 50,'is_rich_method3'] = 'yes'

导致：

然而，我不明白首选的方式是什么。根据您的应用程序，所有方法都同样好吗？

使用

时间，Luke

结论

列表理解在数据量较小的情况下表现最好，因为它们产生的开销很小，即使它们没有矢量化。OTOH，在更大的数据上，loc
和numpy。其中
表现更好-矢量化获胜
请记住，方法的适用性取决于数据、条件数量和列的数据类型。我的建议是在确定选项之前，对数据测试各种方法
然而，从这里可以肯定的一点是，列表理解非常有竞争力，它们是用C实现的，并且对性能进行了高度优化

。以下是正在计时的功能：
def numpy_where(df):
  return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))

def list_comp(df):
  return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])

def loc(df):
  df = df.assign(is_rich='no')
  df.loc[df['salary'] > 50, 'is_rich'] = 'yes'
  return df

你可以测试一下你这边的速度，我会推荐最快的一个：-）
def numpy_where(df):
  return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))

def list_comp(df):
  return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])

def loc(df):
  df = df.assign(is_rich='no')
  df.loc[df['salary'] > 50, 'is_rich'] = 'yes'
  return df