Python 根据条件更改列中的数据帧值_Python_Pandas_Panel Data

Python 根据条件更改列中的数据帧值

python pandas

Python 根据条件更改列中的数据帧值,python,pandas,panel-data,Python,Pandas,Panel Data,我在下面有一个大数据框：此处可以找到作为示例“education_val.csv”的数据教育栏中的价值观包括：我想用以下方式替换“教育”列中的值：如果ID在“教育”列中具有“一年高等教育”的值，则该ID的所有未来年份也将在“教育”列中具有“高等教育”的值如果一个ID在一年内具有中级资格，那么该ID的所有未来年份将在相应的教育栏中具有中级资格。但是，如果此ID的高等教育价值在随后的任何年份出现，则高等教育将在随后的年份取代中级学历，无论是否出现其他学历例如，在下面的数据框架中，ID 2

我在下面有一个大数据框：

此处可以找到作为示例“education_val.csv”的数据

教育栏中的价值观包括：

我想用以下方式替换“教育”列中的值：

如果ID在“教育”列中具有“一年高等教育”的值，则该ID的所有未来年份也将在“教育”列中具有“高等教育”的值

如果一个ID在一年内具有中级资格，那么该ID的所有未来年份将在相应的教育栏中具有中级资格。但是，如果此ID的高等教育价值在随后的任何年份出现，则高等教育将在随后的年份取代中级学历，无论是否出现其他学历

例如，在下面的数据框架中，ID 22445具有1991年高等教育的价值，22445的所有后续教育价值应替换为2017年之前的后几年高等教育

同样，下面数据框中的ID 1587125的值为1991年的中级资格，1993年的值为高等教育。从1993年起，1587125的未来几年教育栏中的所有后续值应为高等教育

数据中有12057个唯一ID，列年份跨度为1991年至2017年。根据上述条件，如何改变所有12057人的教育价值观？我不确定如何以统一的方式为所有唯一ID执行此操作。这里用作示例的示例数据附在上面的Github链接中。非常感谢。

您可以使用以下方法：

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications', 
                               'Other',
                               'intermediate qualifications',
                               'higher education'], 
                               ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

它被显式地分解，因此您可以看到我正在使用的数据操作

创造教育接下来，将“教育”列的数据类型更改为使用该类别数据类型教育猫使用分类的代码执行cummax计算使用索引返回cummax计算EduMax定义的类别产出：

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education

您可以使用以下方法进行操作：

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications', 
                               'Other',
                               'intermediate qualifications',
                               'higher education'], 
                               ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

它被显式地分解，因此您可以看到我正在使用的数据操作

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education

教育水平显然是有秩序的。你的问题可以重新表述为滚动极限问题：到某一年为止，一个人的最高教育水平是什么

试试这个：

# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}

# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)

# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()

# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})

edu['Education'] = tmp

测试：

教育水平显然是有秩序的。你的问题可以重新表述为滚动极限问题：到某一年为止，一个人的最高教育水平是什么

试试这个：

# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}

# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)

# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()

# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})

edu['Education'] = tmp

测试：

你可以遍历ID，然后遍历年份。数据框按时间顺序排列，因此如果一个人在单元格中具有“高等教育”或“中级资格”，则可以保存此知识并将其应用于后续单元格：

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _, row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True

我们可能会多次覆盖每个状态并不重要——如果一个人同时拥有“中级资格”和“高等教育”，我们只需要确保“高等教育”设置在最后

我通常不建议使用for循环来处理数据帧-但是每个单元格值可能依赖于它上面的值，并且数据帧并没有大到使其不可行的程度。

您可以遍历ID，然后遍历年份。数据框按时间顺序排列，因此如果一个人在单元格中具有“高等教育”或“中级资格”，则可以保存此知识并将其应用于后续单元格：

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _, row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True

我们可能会多次覆盖每个状态并不重要——如果一个人同时拥有“中级资格”和“高等教育”，我们只需要确保“高等教育”设置在最后

我通常不建议使用for循环来处理数据帧-但是每个单元格值可能依赖于它上面的值，并且数据帧并没有大到使其不可行的程度。

我认为您的可能更快。可能需要运行一些计时来查看。只需快速运行，它实际上比您的稍微慢一点。lambda内部转换非常酷。我没有想到这一点。谢谢你安排这些时间，我很感谢你的评论。正是为了迎合我充分利用熊猫图书馆的优势。有时会损害numpy或本机python方法的使用：我想你的可能更快。可能需要运行一些计时来查看。只需快速运行，它实际上比您的稍微慢一点。lambda内部转换非常酷。我没有想到这一点。谢谢你安排这些时间，我很感谢你的评论。只是迎合我的力量充分利用熊猫图书馆的重要性。有时会损害numpy或本机python方法的使用：

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _, row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True