Python 使用panda添加缺少的行

Python 使用panda添加缺少的行,python,pandas,append,Python,Pandas,Append,这个问题涉及到,但有点复杂 我有一张这样的桌子: ID DEGREE TERM STATUS GRADTERM 1 Bachelors 20111 1 1 Bachelors 20116 1 2 Bachelors 20126 1 2 Bachelors 20131 1 2 Bachelors 20141 1 3 Bachelors 20106 1 3 Bachelors 20111 1

这个问题涉及到,但有点复杂

我有一张这样的桌子:

ID    DEGREE    TERM    STATUS GRADTERM
1     Bachelors 20111   1
1     Bachelors 20116   1
2     Bachelors 20126   1
2     Bachelors 20131   1
2     Bachelors 20141   1
3     Bachelors 20106   1
3     Bachelors 20111   1       20116
3     Masters   20116   1
3     Masters   20121   1
3     Masters   20131   1       20136
我想把它变成这样(当竞选20151任期时):

在每个表中,状态为0-未注册、1-已注册和2-已毕业。术语字段是一年,后跟1或6表示春季或秋季

在第一次记录和当前期限(本例中为20151)之间,应为每个人添加缺失的期限记录。对于每个添加的记录,分配状态为0,除非最后一个现有记录的状态为2(携带)。也就是说,一个人已注册(状态=1)或未注册(状态=0或2)


我在Python中使用pandas,但我对Python是新手。我一直在试图弄清楚数据帧的索引是如何工作的,但这在这一点上完全是个谜。非常感谢您的指导。

您可以这样做

import pandas as pd
# python 3.4 used
import io

# just try to replicate your data. Use your own csv file instead
# =========================================================
csv = 'ID,DEGREE,TERM,STATUS,GRADTERM\n1,Bachelors,20111,1,\n1,Bachelors,20116,1,\n2,Bachelors,20126,1,\n2,Bachelors,20131,1,\n2,Bachelors,20141,1,\n3,Bachelors,20106,1,\n3,Bachelors,20111,1,20116.0\n3,Masters,20116,1,\n3,Masters,20121,1,\n3,Masters,20131,1,20136.0\n'

df = pd.read_csv(io.StringIO(csv)).set_index('ID')
print(df)

       DEGREE   TERM  STATUS  GRADTERM
ID                                    
1   Bachelors  20111       1       NaN
1   Bachelors  20116       1       NaN
2   Bachelors  20126       1       NaN
2   Bachelors  20131       1       NaN
2   Bachelors  20141       1       NaN
3   Bachelors  20106       1       NaN
3   Bachelors  20111       1     20116
3     Masters  20116       1       NaN
3     Masters  20121       1       NaN
3     Masters  20131       1     20136


# two helper functions
# =========================================================

def build_year_term_range(start_term, current_term):
    # assumes start_term current_term in format '20151' alike
    start_year = int(start_term[:4])  # first four are year
    start_term = int(start_term[-1])  # last four is term
    current_year = int(current_term[:4])
    current_term = int(current_term[-1])
    # build a range
    year_rng = np.repeat(np.arange(start_year, current_year+1), 2)
    term_rng = [1, 6] * int(len(year_rng) / 2)
    year_term_rng = [int(str(year) + str(term)) for year, term in zip(year_rng, term_rng)]
    # check whether need to trim the first and last
    if start_term == 6:  # remove the first
        year_term_rng = year_term_rng[1:]
    if current_term == 1:  # remove the last
        year_term_rng = year_term_rng[:-1]

    return year_term_rng

def my_apply_func(group, current_year_term=current_year_term):
    # start of the record 
    start_year_term = str(group['TERM'].iloc[0])  # gives 2001
    year_term_rng = build_year_term_range(start_year_term, current_year_term)
    # manipulate the group
    group = group.reset_index().set_index('TERM')
    # use reindex to populate missing rows
    group = group.reindex(year_term_rng)
    # fillna ID/DEGREE same as previous
    group[['ID', 'DEGREE']] = group[['ID', 'DEGREE']].fillna(method='ffill')  
    # fillna by 0 not enrolled (for now)
    group['STATUS'] = group['STATUS'].fillna(0)
    # shift GRADTERM 1 slot forward, because GRADTERM and TERM are not aligned
    group['GRADTERM'] = group['GRADTERM'].shift(1)
    # check whether has been graduate, convert to int, use cumsum to carry that non-zero entry forward, convert back to boolean
    # might seems non-trivial at first place :)
    group.loc[group['GRADTERM'].notnull().astype(int).cumsum().astype(bool), 'STATUS'] = 2
    # return only relevant columns
    return group['STATUS']


# start processing
# ============================================================
# move ID from index to a normal column
df = df.reset_index()
# please specify the current year term in string
current_year_term = '20151'
# assume ID is your index column
result = df.groupby(['ID', 'DEGREE']).apply(my_apply_func).reset_index()

Out[163]: 
    ID     DEGREE   TERM  STATUS
0    1  Bachelors  20111       1
1    1  Bachelors  20116       1
2    1  Bachelors  20121       0
3    1  Bachelors  20126       0
4    1  Bachelors  20131       0
5    1  Bachelors  20136       0
6    1  Bachelors  20141       0
7    1  Bachelors  20146       0
8    1  Bachelors  20151       0
9    2  Bachelors  20126       1
10   2  Bachelors  20131       1
11   2  Bachelors  20136       0
12   2  Bachelors  20141       1
13   2  Bachelors  20146       0
14   2  Bachelors  20151       0
15   3  Bachelors  20106       1
16   3  Bachelors  20111       1
17   3  Bachelors  20116       2
18   3  Bachelors  20121       2
19   3  Bachelors  20126       2
20   3  Bachelors  20131       2
21   3  Bachelors  20136       2
22   3  Bachelors  20141       2
23   3  Bachelors  20146       2
24   3  Bachelors  20151       2
25   3    Masters  20116       1
26   3    Masters  20121       1
27   3    Masters  20126       0
28   3    Masters  20131       1
29   3    Masters  20136       2
30   3    Masters  20141       2
31   3    Masters  20146       2
32   3    Masters  20151       2

成功了!谢谢现在我想在原始表中添加一个新列(例如,队列_项)。我想我只需要在my_apply_func中添加一行新行(类似于
group['court']=group['court'].fillna(method='ffill')
),但这不起作用。如何计算?您希望如何计算列“队列”的值?还是让它空白?您正在正确地修改my_apply_func,只需确保它同时返回“队列”列和“状态”列。完美。成功了。我能够添加我感兴趣的所有其他列(实际上还有3个以上没有提到的其他变量)。你的评论对理解这是如何工作的非常有帮助。谢谢很高兴我能帮忙。非常欢迎。:)
import pandas as pd
# python 3.4 used
import io

# just try to replicate your data. Use your own csv file instead
# =========================================================
csv = 'ID,DEGREE,TERM,STATUS,GRADTERM\n1,Bachelors,20111,1,\n1,Bachelors,20116,1,\n2,Bachelors,20126,1,\n2,Bachelors,20131,1,\n2,Bachelors,20141,1,\n3,Bachelors,20106,1,\n3,Bachelors,20111,1,20116.0\n3,Masters,20116,1,\n3,Masters,20121,1,\n3,Masters,20131,1,20136.0\n'

df = pd.read_csv(io.StringIO(csv)).set_index('ID')
print(df)

       DEGREE   TERM  STATUS  GRADTERM
ID                                    
1   Bachelors  20111       1       NaN
1   Bachelors  20116       1       NaN
2   Bachelors  20126       1       NaN
2   Bachelors  20131       1       NaN
2   Bachelors  20141       1       NaN
3   Bachelors  20106       1       NaN
3   Bachelors  20111       1     20116
3     Masters  20116       1       NaN
3     Masters  20121       1       NaN
3     Masters  20131       1     20136


# two helper functions
# =========================================================

def build_year_term_range(start_term, current_term):
    # assumes start_term current_term in format '20151' alike
    start_year = int(start_term[:4])  # first four are year
    start_term = int(start_term[-1])  # last four is term
    current_year = int(current_term[:4])
    current_term = int(current_term[-1])
    # build a range
    year_rng = np.repeat(np.arange(start_year, current_year+1), 2)
    term_rng = [1, 6] * int(len(year_rng) / 2)
    year_term_rng = [int(str(year) + str(term)) for year, term in zip(year_rng, term_rng)]
    # check whether need to trim the first and last
    if start_term == 6:  # remove the first
        year_term_rng = year_term_rng[1:]
    if current_term == 1:  # remove the last
        year_term_rng = year_term_rng[:-1]

    return year_term_rng

def my_apply_func(group, current_year_term=current_year_term):
    # start of the record 
    start_year_term = str(group['TERM'].iloc[0])  # gives 2001
    year_term_rng = build_year_term_range(start_year_term, current_year_term)
    # manipulate the group
    group = group.reset_index().set_index('TERM')
    # use reindex to populate missing rows
    group = group.reindex(year_term_rng)
    # fillna ID/DEGREE same as previous
    group[['ID', 'DEGREE']] = group[['ID', 'DEGREE']].fillna(method='ffill')  
    # fillna by 0 not enrolled (for now)
    group['STATUS'] = group['STATUS'].fillna(0)
    # shift GRADTERM 1 slot forward, because GRADTERM and TERM are not aligned
    group['GRADTERM'] = group['GRADTERM'].shift(1)
    # check whether has been graduate, convert to int, use cumsum to carry that non-zero entry forward, convert back to boolean
    # might seems non-trivial at first place :)
    group.loc[group['GRADTERM'].notnull().astype(int).cumsum().astype(bool), 'STATUS'] = 2
    # return only relevant columns
    return group['STATUS']


# start processing
# ============================================================
# move ID from index to a normal column
df = df.reset_index()
# please specify the current year term in string
current_year_term = '20151'
# assume ID is your index column
result = df.groupby(['ID', 'DEGREE']).apply(my_apply_func).reset_index()

Out[163]: 
    ID     DEGREE   TERM  STATUS
0    1  Bachelors  20111       1
1    1  Bachelors  20116       1
2    1  Bachelors  20121       0
3    1  Bachelors  20126       0
4    1  Bachelors  20131       0
5    1  Bachelors  20136       0
6    1  Bachelors  20141       0
7    1  Bachelors  20146       0
8    1  Bachelors  20151       0
9    2  Bachelors  20126       1
10   2  Bachelors  20131       1
11   2  Bachelors  20136       0
12   2  Bachelors  20141       1
13   2  Bachelors  20146       0
14   2  Bachelors  20151       0
15   3  Bachelors  20106       1
16   3  Bachelors  20111       1
17   3  Bachelors  20116       2
18   3  Bachelors  20121       2
19   3  Bachelors  20126       2
20   3  Bachelors  20131       2
21   3  Bachelors  20136       2
22   3  Bachelors  20141       2
23   3  Bachelors  20146       2
24   3  Bachelors  20151       2
25   3    Masters  20116       1
26   3    Masters  20121       1
27   3    Masters  20126       0
28   3    Masters  20131       1
29   3    Masters  20136       2
30   3    Masters  20141       2
31   3    Masters  20146       2
32   3    Masters  20151       2