Pandas 在需要特定操作顺序时创建辅助变量

Pandas 在需要特定操作顺序时创建辅助变量,pandas,Pandas,我使用的是典型的数据集(行是观察值,列是变量) 我需要根据数据集中的两个原始变量创建一个新变量。逻辑需要包括正确的操作顺序…即。如果(a=1和b>=10)或(a=2和b>=20)。。。我可以在SAS中轻松地完成这项工作(发布在下面),但我正在将一些工作翻译成python。这里列出了我的尝试。我也不知道如何处理南的逻辑。如果任一原始变量为NaN,则新变量也应为NaN。我感谢你的帮助 def OLDER4GRADE (row) : if (row['H1GI20'] == 7 and row

我使用的是典型的数据集(行是观察值,列是变量)

我需要根据数据集中的两个原始变量创建一个新变量。逻辑需要包括正确的操作顺序…即。如果(a=1和b>=10)或(a=2和b>=20)。。。我可以在SAS中轻松地完成这项工作(发布在下面),但我正在将一些工作翻译成python。这里列出了我的尝试。我也不知道如何处理南的逻辑。如果任一原始变量为NaN,则新变量也应为NaN。我感谢你的帮助

def OLDER4GRADE (row) :
    if (row['H1GI20'] == 7 and row['AGE'] >= 14)
          or (row['H1GI20'] == 8 and row['AGE'] >= 15)
          or (row['H1GI20'] == 9 and row['AGE'] >= 16)
          or (row['H1GI20'] == 10 and row['AGE'] >= 17)
          or (row['H1GI20'] == 11 and row['AGE'] >= 18)
          or (row['H1GI20'] == 12 and row['AGE'] >= 19:
                return 1
          else :
                return 0
data['OLDER4GRADE'] = data.apply(lambda row: OLDER4GRADE (row), axis = 1)
下面是SAS中的情况

if H1GI20 EQ . or AGE1 eq . then OLDER4GRADE=.;
    else if (H1GI20=7 and AGE1 GE 14) or (H1GI20=8 and AGE1 GE 15) or (H1GI20=9 and AGE1 GE 16) or 
            (H1GI20=10 and AGE1 GE 17) or (H1GI20=11 and AGE1 GE 18) or (H1GI20=12 and AGE1 GE 19) 
        then OLDER4GRADE=1;
    else OLDER4GRADE=0;

让我们首先修复您的代码:

import numpy as np

def OLDER4GRADE (row) :
    # handle `nan`
    # you check for first in the SAS code as well
    if np.isnan(row['H1GI20']) or np.isnan(row['AGE']): return np.nan

    if (row['H1GI20'] == 7 and row['AGE'] >= 14)
          or (row['H1GI20'] == 8 and row['AGE'] >= 15)
          or (row['H1GI20'] == 9 and row['AGE'] >= 16)
          or (row['H1GI20'] == 10 and row['AGE'] >= 17)
          or (row['H1GI20'] == 11 and row['AGE'] >= 18)
          or (row['H1GI20'] == 12 and row['AGE'] >= 19:
                return 1
          else :
                return 0

# apply the function is good enough, no need `lambda`
data['OLDER4GRADE'] = data.apply(OLDER4GRADE, axis = 1)
现在,对于熊猫,建议尽可能避免沿行应用
apply
。您的逻辑可以转换为:

# rows with `nan` in either column
invalid = data[['H1GI20', 'AGE']].isna().any(axis=1)

# the threshold for each category
thresholds = {
    7: 14,
    8: 15,
    9: 16,
   10: 17,
   11: 18,
   12: 19
}

# use `map` to turn `H1GI20` into respective threshold and compare
above_thresh = data['H1GI20'].map(thresholds) >= data['AGE']

data['OLDER4GRADE'] = np.where(invalid, np.nan, above_thresh.astype(int))

成功了!非常感谢。我已经导入了numpy,但认为新的变量代码是pandas库的一部分。作为一名SAS程序员,我仍然习惯于这个库的概念。我非常感谢你的帮助。