Python 将具有相同ID的行的值替换为max date
下面是有关df简化版本的脚本:Python 将具有相同ID的行的值替换为max date,python,pandas,Python,Pandas,下面是有关df简化版本的脚本: import pandas as pd df = pd.DataFrame({ 'id': ['1', '1','2','2','3','3','4','4','5','6','7'], 'product1_expiry_date' : ['-','-','2020-11-28','2020-11-13','-',
import pandas as pd
df = pd.DataFrame({
'id': ['1', '1','2','2','3','3','4','4','5','6','7'],
'product1_expiry_date' : ['-','-','2020-11-28','2020-11-13','-',
'2020-11-13','2020-12-13','-','2020-11-16','-',
'2020-11-28'],
'product2_expiry_date' : ['2020-11-16','2020-11-19','-',
'-','2020-11-23','2020-11-13',
'2020-12-13','-','2020-12-01','2020-12-01',
'2020-12-14']
})
df
id product1_expiry_date product2_expiry_date
1 - 2020-11-16
1 - 2020-11-19
2 2020-11-28 -
2 2020-11-13 -
3 - 2020-11-23
3 2020-11-13 2020-11-13
4 2020-12-13 2020-12-13
4 - -
5 2020-11-16 2020-12-01
6 - 2020-12-01
7 2020-11-28 2020-12-14
我希望没有重复的ID,对于每个ID,在适用的情况下删除较早的日期和“-”值。因为我只对以后的日期感兴趣
预期DF:
id product1_expiry_date product2_expiry_date
1 - 2020-11-19
2 2020-11-28 -
3 2020-11-13 2020-11-23
4 2020-11-13 2020-11-13
5 2020-12-13 2020-12-13
6 2020-11-16 2020-12-01
7 2020-11-28 2020-12-14
非常感谢您的帮助。将
Id
转换为索引,然后将所有列转换为日期时间,并对每个索引使用max
:
f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0)
print (df1)
product1_expiry_date product2_expiry_date
id
1 NaT 2020-11-19
2 2020-11-28 NaT
3 2020-11-13 2020-11-23
4 2020-12-13 2020-12-13
5 2020-11-16 2020-12-01
6 NaT 2020-12-01
7 2020-11-28 2020-12-14
如果希望将NaT
替换为-
是可能的,但会将日期时间与字符串混合,因此下一个处理应该是问题:
f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0).fillna('-')
print (df1)
product1_expiry_date product2_expiry_date
id
1 - 2020-11-19 00:00:00
2 2020-11-28 00:00:00 -
3 2020-11-13 00:00:00 2020-11-23 00:00:00
4 2020-12-13 00:00:00 2020-12-13 00:00:00
5 2020-11-16 00:00:00 2020-12-01 00:00:00
6 - 2020-12-01 00:00:00
7 2020-11-28 00:00:00 2020-12-14 00:00:00
如有必要,最后一个id
列:
df1 = df1.reset_index()
将
Id
转换为索引,然后将所有列转换为日期时间,并对每个索引使用max
:
f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0)
print (df1)
product1_expiry_date product2_expiry_date
id
1 NaT 2020-11-19
2 2020-11-28 NaT
3 2020-11-13 2020-11-23
4 2020-12-13 2020-12-13
5 2020-11-16 2020-12-01
6 NaT 2020-12-01
7 2020-11-28 2020-12-14
如果希望将NaT
替换为-
是可能的,但会将日期时间与字符串混合,因此下一个处理应该是问题:
f = lambda x: pd.to_datetime(x, errors='coerce')
df1 = df.set_index('id').apply(f).max(level=0).fillna('-')
print (df1)
product1_expiry_date product2_expiry_date
id
1 - 2020-11-19 00:00:00
2 2020-11-28 00:00:00 -
3 2020-11-13 00:00:00 2020-11-23 00:00:00
4 2020-12-13 00:00:00 2020-12-13 00:00:00
5 2020-11-16 00:00:00 2020-12-01 00:00:00
6 - 2020-12-01 00:00:00
7 2020-11-28 00:00:00 2020-12-14 00:00:00
如有必要,最后一个id
列:
df1 = df1.reset_index()