Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 计算迄今为止的平均值和差异条件值_Python_Pandas_Dataframe_Diff_Mean - Fatal编程技术网

Python 计算迄今为止的平均值和差异条件值

Python 计算迄今为止的平均值和差异条件值,python,pandas,dataframe,diff,mean,Python,Pandas,Dataframe,Diff,Mean,我有下面的数据框,其中一个给定的作业work\u id由一个学生s\u id在一个日期work\u date中执行,相对分数score。对于每个学生,日期按降序排列 df = pd.DataFrame(columns=['work_id', 's_id', 'score','work_date'], ... data =[['a3', 'p01', np.nan,'2020-05-01'], ... ['a2'

我有下面的数据框,其中一个给定的作业
work\u id
由一个学生
s\u id
在一个日期
work\u date
中执行,相对分数
score
。对于每个学生,日期按降序排列

df = pd.DataFrame(columns=['work_id', 's_id', 'score','work_date'],
...                   data =[['a3', 'p01', np.nan,'2020-05-01'],
...                          ['a2', 'p01',10,'2020-06-10'],
...                          ['a1','p01', 5, '2020-06-15'],
...                          ['a5','p02', 5, '2019-10-10'],
...                          ['a7','p02', 11, '2020-03-01'],
...                          ['a6','p02', np.nan, '2020-04-01'],
...                          ['a4','p02', 4, '2020-06-20'],
...                          ])

>>> df
  work_id s_id  score   work_date
0      a3  p01    NaN  2020-05-01
1      a2  p01   10.0  2020-06-10
2      a1  p01    5.0  2020-06-15
3      a5  p02    5.0  2019-10-10
4      a7  p02   11.0  2020-03-01
5      a6  p02    NaN  2020-04-01
6      a4  p02    4.0  2020-06-20
我想添加两列:
mean_score
diff_score
mean_score
列应显示每个学生获得的平均分数,其中计算的平均分数包括之前作业中获得的所有分数。列
diff_score
应包含当前分数和上一个分数(不是NaN)之间的差异。因此,最终数据帧必须如下所示:

work_id s_id  score   work_date  mean_score  diff_score
0      a3  p01    9.0  2020-05-01         NaN         NaN
1      a2  p01   10.0  2020-06-10    10.00000         NaN
2      a1  p01    5.0  2020-06-15    7.500000        -5.0
3      a5  p02    5.0  2019-10-10    5.000000         NaN
4      a7  p02   11.0  2020-03-01    8.000000         6.0
5      a6  p02    NaN  2020-04-01         NaN         NaN
6      a4  p02    4.0  2020-06-20    6.666667        -7.0
我可以通过定义以下两个函数(处理可能存在的NaN条目)并使用apply/lambda以一种繁琐的方式实现这一点:

def calculate_mean(workid):
    date = df[df.work_id == workid].work_date.iloc[0]
    sid = df[df.work_id == workid].s_id.iloc[0]
    if df[(df.work_id==workid) & (df.s_id==sid) & (df.work_date == date)].score.notnull().item():
        mean = df[(df.s_id == sid) & (df.work_date <= date)].score.mean()
    else:
        mean = np.nan
    return mean

def calculate_diff(workid):
    date = df[df.work_id == workid].work_date.iloc[0]
    sid = df[df.work_id == workid].s_id.iloc[0]
    try:
        if df[(df.s_id==sid) & (df.work_date == date)].score.notnull().item():
            delta = df[(df.s_id == sid) & (df.work_date <= date) & (df.score.notnull())].score.diff().iloc[-1]
        else:
            delta = np.nan
    except:
        delta = np.nan
    return delta 

df['mean_score'] = df['work_id'].apply(lambda x: calculate_mean(x) )
df['diff_score'] = df['work_id'].apply(lambda x: calculate_diff(x) )
def计算平均值(工作ID):
日期=df[df.work\u id==workid].work\u日期.iloc[0]
sid=df[df.work\u id==workid].s\u id.iloc[0]
如果df[(df.work\u id==workid)&(df.s\u id==sid)&(df.work\u date==date)].score.notnull().item():

平均值=df[(df.s_id==sid)和(df.work_dateIIUC,使用
pandas.DataFrame.groupby
扩展。平均值
diff

g = df.groupby("s_id")["score"]
s1 = g.apply(lambda x: x.dropna().expanding().mean())
s2 = g.apply(lambda x: x.dropna().diff())

df["mean_score"] = s1.reset_index(level=0, drop=True)
df["diff_score"] = s2.reset_index(level=0, drop=True)
print(df)
或者做一个函数:

def mean_and_diff(series):
    s = series.dropna()
    d = {"mean_score": s.expanding().mean(), "diff_score": s.diff()}
    return pd.DataFrame(d)
tmp = df.groupby("s_id")["score"].apply(mean_and_diff).reset_index(level=0, drop=True)
df[["mean_score", "diff_score"]] = tmp[["mean_score", "diff_score"]]
输出:

  work_id s_id  score   work_date  mean_score  diff_score
0      a3  p01    NaN  2020-05-01         NaN         NaN
1      a2  p01   10.0  2020-06-10   10.000000         NaN
2      a1  p01    5.0  2020-06-15    7.500000        -5.0
3      a5  p02    5.0  2019-10-10    5.000000         NaN
4      a7  p02   11.0  2020-03-01    8.000000         6.0
5      a6  p02    NaN  2020-04-01         NaN         NaN
6      a4  p02    4.0  2020-06-20    6.666667        -7.0