Python 如何将数据帧中具有不同信息或NAN的两个重复行合并为一行?
我有一个数据框,如下所示:Python 如何将数据帧中具有不同信息或NAN的两个重复行合并为一行?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框,如下所示: Name Email Assessment 1 Assessment 2 Assessment 3 Assessment 4 Assessment 5 0 abc abc@email.com Good NaN NaN NaN NaN 1 abc abc@email.com NaN Good
Name Email Assessment 1 Assessment 2 Assessment 3 Assessment 4 Assessment 5
0 abc abc@email.com Good NaN NaN NaN NaN
1 abc abc@email.com NaN Good Good NaN NaN
2 abc abc@email.com NaN NaN NaN Good NaN
3 abc abc@email.com NaN NaN NaN NaN Good
4 john john@email.com Good Good Fail NaN NaN
5 john john@email.com NaN NaN Good NaN NaN
6 john john@email.com NaN NaN NaN Good Good
7 joe joe@email.com Good Good Fail Fail NaN
8 joe joe@email.com NaN NaN Fail Good Good
9 joe joe@email.com NaN NaN Fail NaN NaN
在这里,我尝试合并重复的记录,使用电子邮件作为键,并保留以下行中未丢失的信息作为最终信息。在上面的示例中,以下是我的预期输出:
Name Email Assessment 1 Assessment 2 Assessment 3 Assessment 4 Assessment 5
0 abc abc@email.com Good Good Good Good Good
1 john john@email.com Good Good Good Good Good
2 joe joe@email.com Good Good Fail Good Good
我在这里看到了许多关于行组合的解决方案,但它们大多涉及内容的串联,即,它们为电子邮件创建一个行值,如Good Good
或Good Good Fail
,但不是以示例输出中所示的方式。请帮忙
样本数据
data_dict = pd.DataFrame({'Name': ['abc','abc','abc','abc','john','john','john','joe','joe','joe'],
'Email': ['abc@email.com','abc@email.com','abc@email.com','abc@email.com','john@email.com','john@email.com','john@email.com','joe@email.com','joe@email.com','joe@email.com'],
'Assessment 1': ['Good', np.nan, np.nan, np.nan, 'Good', np.nan, np.nan, 'Good', np.nan, np.nan],
'Assessment 2': [np.nan,'Good',np.nan,np.nan,'Good',np.nan,np.nan,'Good ',np.nan,np.nan],
'Assessment 3': [np.nan,'Good',np.nan,np.nan,'Fail','Good',np.nan,'Fail','Fail','Fail'],
'Assessment 4': [np.nan,np.nan,'Good',np.nan,np.nan,np.nan,'Good','Fail','Good',np.nan],
'Assessment 5': [np.nan,np.nan,np.nan,'Good',np.nan,np.nan,'Good','Fail','Good',np.nan]} )
如果需要最后一次唯一排序,且每组不缺少值,请使用:
df = pd.DataFrame(data_dict)
def f(x):
try:
return np.sort(x.dropna().unique())[-1]
except:
return np.nan
df = df.groupby(['Name','Email'], as_index=False, sort=False).agg(f)
print (df)
Name Email Assessment 1 Assessment 2 Assessment 3 Assessment 4 \
0 abc abc@email.com Good Good Good Good
1 john john@email.com Good Good Good Good
2 joe joe@email.com Good Good Fail Good
Assessment 5
0 Good
1 Good
2 Good
编辑:
如果需要最后一个非缺失值,请使用:
如果每组的值为
故障良好
或故障良好
,则逻辑是什么?在任何情况下,应以最新一行的结果为准Fail-Good
,将Good
保持在后一行中,按照相同的逻辑,Good-Fail
,Fail
将是组合行中的值。确定,然后编辑了答案。
df = df.groupby(['Name','Email'], as_index=False, sort=False).last()
print (df)
Name Email Assessment 1 Assessment 2 Assessment 3 Assessment 4 \
0 abc abc@email.com Good Good Good Good
1 john john@email.com Good Good Good Good
2 joe joe@email.com Good Good Fail Good
Assessment 5
0 Good
1 Good
2 Good