Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/290.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python/Pandas:如何在不同的列中使用NaN合并重复的行?_Python_Pandas - Fatal编程技术网

Python/Pandas:如何在不同的列中使用NaN合并重复的行?

Python/Pandas:如何在不同的列中使用NaN合并重复的行?,python,pandas,Python,Pandas,一定有更好的办法,请帮帮我 下面是我必须清理的一些数据的摘录,其中有几种“重复”行(并非所有行都是重复的): df= 因此,我有以下类型的重复案例: 列CreditScore中的NaN和有效值(LoanID=100) 列AnnualIncome中的NaN和有效值(LoanID=200) CreditScore列中的NaN和有效值以及AnnualIncome列中的NaN和有效值(贷款ID=300) LoanID 400和500为“正常”病例 显然,我想要的是一个没有重复项的数据帧,比如: Loan

一定有更好的办法,请帮帮我

下面是我必须清理的一些数据的摘录,其中有几种“重复”行(并非所有行都是重复的):

df=

因此,我有以下类型的重复案例:

  • 列CreditScore中的NaN和有效值(LoanID=100)
  • 列AnnualIncome中的NaN和有效值(LoanID=200)
  • CreditScore列中的NaN和有效值以及AnnualIncome列中的NaN和有效值(贷款ID=300)
  • LoanID 400和500为“正常”病例
  • 显然,我想要的是一个没有重复项的数据帧,比如:

    LoanID | CustomerID | LoanStatus | CreditScore | AnnualIncome | ...
    -------+------------+------------+-------------+--------------+-----
       100 | ABC        | Paid       |         724 |        34200 |
       200 | DEF        | Write Off  |         611 |         9800 |
       300 | GHI        | Paid       |         799 |       247112 |
       400 | JKL        | Paid       |         NaN |          NaN |
       500 | MNO        | Paid       |         444 |          NaN |
    
    那么,我是如何解决这个问题的:

    # Get the repeated keys:
    rep = df['LoanID'].value_counts()
    rep = rep[rep > 2]
    
    # Now we get the valid number (we overwrite the NaNs)
    for i in rep.keys():
        df.loc[df['LoanID'] == i, 'CreditScore']  = df[df['LoanID'] == i]['CreditScore'].max()
        df.loc[df['LoanID'] == i, 'AnnualIncome'] = df[df['LoanID'] == i]['AnnualIncome'].max()
    
    # Drop duplicates   
    df.drop_duplicates(inplace=True)
    

    这很有效,完全符合我的需要,问题是这个数据帧有几个100k记录,所以这个方法需要“永远”,一定有更好的方法,对吗?

    按贷款id分组,填充上面和下面缺少的值,并删除重复项似乎很有效:

    df.groupby('LoanID').apply(lambda x: \
                                 fillna(method='ffill').\
                                 fillna(method='bfill').\
                                 drop_duplicates()).\
                         reset_index(drop=True).\
                         set_index('LoanID')
    #       CustomerID LoanStatus  CreditScore  AnnualIncome  
    #LoanID                                                             
    #100           ABC       Paid        724.0       34200.0       
    #200           DEF  Write Off        611.0        9800.0       
    #300           GHI       Paid        799.0      247112.0       
    #400           JKL       Paid          NaN           NaN       
    #500           MNO       Paid        444.0           NaN       
    
    df.groupby('LoanID').apply(lambda x: \
                                 fillna(method='ffill').\
                                 fillna(method='bfill').\
                                 drop_duplicates()).\
                         reset_index(drop=True).\
                         set_index('LoanID')
    #       CustomerID LoanStatus  CreditScore  AnnualIncome  
    #LoanID                                                             
    #100           ABC       Paid        724.0       34200.0       
    #200           DEF  Write Off        611.0        9800.0       
    #300           GHI       Paid        799.0      247112.0       
    #400           JKL       Paid          NaN           NaN       
    #500           MNO       Paid        444.0           NaN