Python 将具有相同ID的行的值替换为max date_Python_Pandas

Python 将具有相同ID的行的值替换为max date

python pandas

Python 将具有相同ID的行的值替换为max date,python,pandas,Python,Pandas,下面是有关df简化版本的脚本： import pandas as pd df = pd.DataFrame({ 'id': ['1', '1','2','2','3','3','4','4','5','6','7'], 'product1_expiry_date' : ['-','-','2020-11-28','2020-11-13','-',

下面是有关df简化版本的脚本：

import pandas as pd
    
df = pd.DataFrame({ 
                   'id': ['1', '1','2','2','3','3','4','4','5','6','7'],
                   'product1_expiry_date' : ['-','-','2020-11-28','2020-11-13','-',
                                             '2020-11-13','2020-12-13','-','2020-11-16','-',
                                             '2020-11-28'],
                   'product2_expiry_date' : ['2020-11-16','2020-11-19','-',
                                             '-','2020-11-23','2020-11-13',
                                             '2020-12-13','-','2020-12-01','2020-12-01',
                                             '2020-12-14']
                 })
 df

id  product1_expiry_date    product2_expiry_date
1            -                   2020-11-16
1            -                   2020-11-19
2        2020-11-28                  -
2        2020-11-13                  -
3            -                   2020-11-23
3        2020-11-13              2020-11-13
4        2020-12-13              2020-12-13
4            -                         -
5        2020-11-16              2020-12-01
6            -                   2020-12-01
7        2020-11-28              2020-12-14

我希望没有重复的ID，对于每个ID，在适用的情况下删除较早的日期和“-”值。因为我只对以后的日期感兴趣

预期DF：

   id   product1_expiry_date    product2_expiry_date
    1            -                  2020-11-19
    2        2020-11-28                 -
    3        2020-11-13             2020-11-23
    4        2020-11-13             2020-11-13
    5        2020-12-13             2020-12-13
    6        2020-11-16             2020-12-01
    7        2020-11-28             2020-12-14

非常感谢您的帮助。

将

Id

转换为索引，然后将所有列转换为日期时间，并对每个索引使用

max

：

f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0)
print (df1)
   product1_expiry_date product2_expiry_date
id                                          
1                   NaT           2020-11-19
2            2020-11-28                  NaT
3            2020-11-13           2020-11-23
4            2020-12-13           2020-12-13
5            2020-11-16           2020-12-01
6                   NaT           2020-12-01
7            2020-11-28           2020-12-14

如果希望将

NaT

替换为

是可能的，但会将日期时间与字符串混合，因此下一个处理应该是问题：

f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0).fillna('-')
print (df1)
   product1_expiry_date product2_expiry_date
id                                          
1                     -  2020-11-19 00:00:00
2   2020-11-28 00:00:00                    -
3   2020-11-13 00:00:00  2020-11-23 00:00:00
4   2020-12-13 00:00:00  2020-12-13 00:00:00
5   2020-11-16 00:00:00  2020-12-01 00:00:00
6                     -  2020-12-01 00:00:00
7   2020-11-28 00:00:00  2020-12-14 00:00:00

如有必要，最后一个

id

列：

df1 = df1.reset_index()

将

Id

转换为索引，然后将所有列转换为日期时间，并对每个索引使用

max

：

f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0)
print (df1)
   product1_expiry_date product2_expiry_date
id                                          
1                   NaT           2020-11-19
2            2020-11-28                  NaT
3            2020-11-13           2020-11-23
4            2020-12-13           2020-12-13
5            2020-11-16           2020-12-01
6                   NaT           2020-12-01
7            2020-11-28           2020-12-14

如果希望将

NaT

替换为

是可能的，但会将日期时间与字符串混合，因此下一个处理应该是问题：

f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0).fillna('-')
print (df1)
   product1_expiry_date product2_expiry_date
id                                          
1                     -  2020-11-19 00:00:00
2   2020-11-28 00:00:00                    -
3   2020-11-13 00:00:00  2020-11-23 00:00:00
4   2020-12-13 00:00:00  2020-12-13 00:00:00
5   2020-11-16 00:00:00  2020-12-01 00:00:00
6                     -  2020-12-01 00:00:00
7   2020-11-28 00:00:00  2020-12-14 00:00:00

如有必要，最后一个

id

列：

df1 = df1.reset_index()