Python 根据条件更改列中的数据帧值
我在下面有一个大数据框: 此处可以找到作为示例“education_val.csv”的数据 教育栏中的价值观包括: 我想用以下方式替换“教育”列中的值: 如果ID在“教育”列中具有“一年高等教育”的值,则该ID的所有未来年份也将在“教育”列中具有“高等教育”的值 如果一个ID在一年内具有中级资格,那么该ID的所有未来年份将在相应的教育栏中具有中级资格。但是,如果此ID的高等教育价值在随后的任何年份出现,则高等教育将在随后的年份取代中级学历,无论是否出现其他学历 例如,在下面的数据框架中,ID 22445具有1991年高等教育的价值,22445的所有后续教育价值应替换为2017年之前的后几年高等教育 同样,下面数据框中的ID 1587125的值为1991年的中级资格,1993年的值为高等教育。从1993年起,1587125的未来几年教育栏中的所有后续值应为高等教育 数据中有12057个唯一ID,列年份跨度为1991年至2017年。根据上述条件,如何改变所有12057人的教育价值观?我不确定如何以统一的方式为所有唯一ID执行此操作。这里用作示例的示例数据附在上面的Github链接中。非常感谢。您可以使用以下方法:Python 根据条件更改列中的数据帧值,python,pandas,panel-data,Python,Pandas,Panel Data,我在下面有一个大数据框: 此处可以找到作为示例“education_val.csv”的数据 教育栏中的价值观包括: 我想用以下方式替换“教育”列中的值: 如果ID在“教育”列中具有“一年高等教育”的值,则该ID的所有未来年份也将在“教育”列中具有“高等教育”的值 如果一个ID在一年内具有中级资格,那么该ID的所有未来年份将在相应的教育栏中具有中级资格。但是,如果此ID的高等教育价值在随后的任何年份出现,则高等教育将在随后的年份取代中级学历,无论是否出现其他学历 例如,在下面的数据框架中,ID 2
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
它被显式地分解,因此您可以看到我正在使用的数据操作
创造教育
接下来,将“教育”列的数据类型更改为使用该类别
数据类型教育猫
使用分类的代码执行cummax计算
使用索引返回cummax计算EduMax定义的类别
产出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
您可以使用以下方法进行操作:
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
它被显式地分解,因此您可以看到我正在使用的数据操作
创造教育
接下来,将“教育”列的数据类型更改为使用该类别
数据类型教育猫
使用分类的代码执行cummax计算
使用索引返回cummax计算EduMax定义的类别
产出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
教育水平显然是有秩序的。你的问题可以重新表述为滚动极限问题:到某一年为止,一个人的最高教育水平是什么 试试这个:
# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}
# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)
# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()
# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})
edu['Education'] = tmp
测试:
教育水平显然是有秩序的。你的问题可以重新表述为滚动极限问题:到某一年为止,一个人的最高教育水平是什么 试试这个:
# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}
# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)
# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()
# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})
edu['Education'] = tmp
测试:
你可以遍历ID,然后遍历年份。数据框按时间顺序排列,因此如果一个人在单元格中具有“高等教育”或“中级资格”,则可以保存此知识并将其应用于后续单元格:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True
我们可能会多次覆盖每个状态并不重要——如果一个人同时拥有“中级资格”和“高等教育”,我们只需要确保“高等教育”设置在最后
我通常不建议使用for循环来处理数据帧-但是每个单元格值可能依赖于它上面的值,并且数据帧并没有大到使其不可行的程度。您可以遍历ID,然后遍历年份。数据框按时间顺序排列,因此如果一个人在单元格中具有“高等教育”或“中级资格”,则可以保存此知识并将其应用于后续单元格:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True
我们可能会多次覆盖每个状态并不重要——如果一个人同时拥有“中级资格”和“高等教育”,我们只需要确保“高等教育”设置在最后
我通常不建议使用for循环来处理数据帧-但是每个单元格值可能依赖于它上面的值,并且数据帧并没有大到使其不可行的程度。我认为您的可能更快。可能需要运行一些计时来查看。只需快速运行,它实际上比您的稍微慢一点。lambda内部转换非常酷。我没有想到这一点。谢谢你安排这些时间,我很感谢你的评论。正是为了迎合我充分利用熊猫图书馆的优势。有时会损害numpy或本机python方法的使用:我想你的可能更快。可能需要运行一些计时来查看。只需快速运行,它实际上比您的稍微慢一点。lambda内部转换非常酷。我没有想到这一点。谢谢你安排这些时间,我很感谢你的评论。只是迎合我的力量 充分利用熊猫图书馆的重要性。有时会损害numpy或本机python方法的使用:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True