Pandas 在需要特定操作顺序时创建辅助变量
我使用的是典型的数据集(行是观察值,列是变量) 我需要根据数据集中的两个原始变量创建一个新变量。逻辑需要包括正确的操作顺序…即。如果(a=1和b>=10)或(a=2和b>=20)。。。我可以在SAS中轻松地完成这项工作(发布在下面),但我正在将一些工作翻译成python。这里列出了我的尝试。我也不知道如何处理南的逻辑。如果任一原始变量为NaN,则新变量也应为NaN。我感谢你的帮助Pandas 在需要特定操作顺序时创建辅助变量,pandas,Pandas,我使用的是典型的数据集(行是观察值,列是变量) 我需要根据数据集中的两个原始变量创建一个新变量。逻辑需要包括正确的操作顺序…即。如果(a=1和b>=10)或(a=2和b>=20)。。。我可以在SAS中轻松地完成这项工作(发布在下面),但我正在将一些工作翻译成python。这里列出了我的尝试。我也不知道如何处理南的逻辑。如果任一原始变量为NaN,则新变量也应为NaN。我感谢你的帮助 def OLDER4GRADE (row) : if (row['H1GI20'] == 7 and row
def OLDER4GRADE (row) :
if (row['H1GI20'] == 7 and row['AGE'] >= 14)
or (row['H1GI20'] == 8 and row['AGE'] >= 15)
or (row['H1GI20'] == 9 and row['AGE'] >= 16)
or (row['H1GI20'] == 10 and row['AGE'] >= 17)
or (row['H1GI20'] == 11 and row['AGE'] >= 18)
or (row['H1GI20'] == 12 and row['AGE'] >= 19:
return 1
else :
return 0
data['OLDER4GRADE'] = data.apply(lambda row: OLDER4GRADE (row), axis = 1)
下面是SAS中的情况
if H1GI20 EQ . or AGE1 eq . then OLDER4GRADE=.;
else if (H1GI20=7 and AGE1 GE 14) or (H1GI20=8 and AGE1 GE 15) or (H1GI20=9 and AGE1 GE 16) or
(H1GI20=10 and AGE1 GE 17) or (H1GI20=11 and AGE1 GE 18) or (H1GI20=12 and AGE1 GE 19)
then OLDER4GRADE=1;
else OLDER4GRADE=0;
让我们首先修复您的代码:
import numpy as np
def OLDER4GRADE (row) :
# handle `nan`
# you check for first in the SAS code as well
if np.isnan(row['H1GI20']) or np.isnan(row['AGE']): return np.nan
if (row['H1GI20'] == 7 and row['AGE'] >= 14)
or (row['H1GI20'] == 8 and row['AGE'] >= 15)
or (row['H1GI20'] == 9 and row['AGE'] >= 16)
or (row['H1GI20'] == 10 and row['AGE'] >= 17)
or (row['H1GI20'] == 11 and row['AGE'] >= 18)
or (row['H1GI20'] == 12 and row['AGE'] >= 19:
return 1
else :
return 0
# apply the function is good enough, no need `lambda`
data['OLDER4GRADE'] = data.apply(OLDER4GRADE, axis = 1)
现在,对于熊猫,建议尽可能避免沿行应用apply
。您的逻辑可以转换为:
# rows with `nan` in either column
invalid = data[['H1GI20', 'AGE']].isna().any(axis=1)
# the threshold for each category
thresholds = {
7: 14,
8: 15,
9: 16,
10: 17,
11: 18,
12: 19
}
# use `map` to turn `H1GI20` into respective threshold and compare
above_thresh = data['H1GI20'].map(thresholds) >= data['AGE']
data['OLDER4GRADE'] = np.where(invalid, np.nan, above_thresh.astype(int))
成功了!非常感谢。我已经导入了numpy,但认为新的变量代码是pandas库的一部分。作为一名SAS程序员,我仍然习惯于这个库的概念。我非常感谢你的帮助。