Python/Pandas:如何在不同的列中使用NaN合并重复的行?
一定有更好的办法,请帮帮我 下面是我必须清理的一些数据的摘录,其中有几种“重复”行(并非所有行都是重复的): df= 因此,我有以下类型的重复案例:Python/Pandas:如何在不同的列中使用NaN合并重复的行?,python,pandas,Python,Pandas,一定有更好的办法,请帮帮我 下面是我必须清理的一些数据的摘录,其中有几种“重复”行(并非所有行都是重复的): df= 因此,我有以下类型的重复案例: 列CreditScore中的NaN和有效值(LoanID=100) 列AnnualIncome中的NaN和有效值(LoanID=200) CreditScore列中的NaN和有效值以及AnnualIncome列中的NaN和有效值(贷款ID=300) LoanID 400和500为“正常”病例 显然,我想要的是一个没有重复项的数据帧,比如: Loan
LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
-------+------------+------------+-------------+--------------+-----
100 | ABC | Paid | 724 | 34200 |
200 | DEF | Write Off | 611 | 9800 |
300 | GHI | Paid | 799 | 247112 |
400 | JKL | Paid | NaN | NaN |
500 | MNO | Paid | 444 | NaN |
那么,我是如何解决这个问题的:
# Get the repeated keys:
rep = df['LoanID'].value_counts()
rep = rep[rep > 2]
# Now we get the valid number (we overwrite the NaNs)
for i in rep.keys():
df.loc[df['LoanID'] == i, 'CreditScore'] = df[df['LoanID'] == i]['CreditScore'].max()
df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
# Drop duplicates
df.drop_duplicates(inplace=True)
这很有效,完全符合我的需要,问题是这个数据帧有几个100k记录,所以这个方法需要“永远”,一定有更好的方法,对吗?按贷款id分组,填充上面和下面缺少的值,并删除重复项似乎很有效:
df.groupby('LoanID').apply(lambda x: \
fillna(method='ffill').\
fillna(method='bfill').\
drop_duplicates()).\
reset_index(drop=True).\
set_index('LoanID')
# CustomerID LoanStatus CreditScore AnnualIncome
#LoanID
#100 ABC Paid 724.0 34200.0
#200 DEF Write Off 611.0 9800.0
#300 GHI Paid 799.0 247112.0
#400 JKL Paid NaN NaN
#500 MNO Paid 444.0 NaN
df.groupby('LoanID').apply(lambda x: \
fillna(method='ffill').\
fillna(method='bfill').\
drop_duplicates()).\
reset_index(drop=True).\
set_index('LoanID')
# CustomerID LoanStatus CreditScore AnnualIncome
#LoanID
#100 ABC Paid 724.0 34200.0
#200 DEF Write Off 611.0 9800.0
#300 GHI Paid 799.0 247112.0
#400 JKL Paid NaN NaN
#500 MNO Paid 444.0 NaN