Python 是否有方法保留具有特定条件的行,如果不满足此条件,则删除其他行?
我有以下数据帧(df) 问题是: 当结束时间相同时,我想删除持续时间较长的行,删除持续时间最短的行 预期结果:Python 是否有方法保留具有特定条件的行,如果不满足此条件,则删除其他行?,python,pandas,dataframe,duplicates,Python,Pandas,Dataframe,Duplicates,我有以下数据帧(df) 问题是: 当结束时间相同时,我想删除持续时间较长的行,删除持续时间最短的行 预期结果: ID start end Diff A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5 B 1/9/2020 1:00:06 AM
ID start end Diff
A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5
本质上,当结束时间相同时,我希望删除持续时间较长的行。
我尝试过这种方法,但是,它没有考虑到以下情况:
当结束时间相同时,保留较短的持续时间行
df.sort_values(['Diff']).drop_duplicates(subset=['ID'])
感谢您的任何建议。在
结束
列上使用groupby
转换为Diff
的最小值,然后与df['Diff']
进行比较,并保持返回值为真,检查transform如何在下面的整个组上返回最小值:
df[df['Diff'].eq(df.groupby('end')['Diff'].transform('min'))]
groupby+transform的输出
print(df.groupby('end')['Diff'].transform('min'))
0 5
1 5
2 4
3 4
4 5
5 20
6 5
我们可以用
按“开始”
排序,“较短”的持续时间自然是最后一个。然后使用删除重复项
df.sort_values(['ID', 'start', 'end']).drop_duplicates(['ID', 'end'], keep='last')
ID start end Diff
1 A 2020-01-08 00:00:05 2020-01-08 00:00:10 5
3 B 2020-01-09 01:00:06 2020-01-09 01:00:10 4
4 B 2020-01-09 01:00:20 2020-01-09 01:00:25 5
5 C 2020-01-10 05:00:05 2020-01-10 05:00:25 20
6 C 2020-01-10 05:00:40 2020-01-10 05:00:45 5
按ID
和end
排序,然后选择Diff
最短的一个
>>> df.sort_values(['ID', 'end', 'Diff']).groupby(['ID', 'end'], sort=False).head(1)
ID start end Diff
1 A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
3 B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
4 B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
5 C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
6 C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5
好的,我试试看。你能解释一下代码在做什么吗?我仍然在学习更多关于Pandas@TanishaHudson刚刚更新了解释:)如果你需要什么,请告诉我,这是完美的工作。我得等10分钟才能投票。非常感谢。也很有帮助。非常感谢。
print(df['Diff'].eq(df.groupby('end')['Diff'].transform('min')))
0 False
1 True
2 False
3 True
4 True
5 True
6 True
df[df['Diff'].eq(df['end'].map(df.groupby('end')['Diff'].min()))]
ID start end Diff
1 A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
3 B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
4 B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
5 C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
6 C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5
df.sort_values(['ID', 'start', 'end']).drop_duplicates(['ID', 'end'], keep='last')
ID start end Diff
1 A 2020-01-08 00:00:05 2020-01-08 00:00:10 5
3 B 2020-01-09 01:00:06 2020-01-09 01:00:10 4
4 B 2020-01-09 01:00:20 2020-01-09 01:00:25 5
5 C 2020-01-10 05:00:05 2020-01-10 05:00:25 20
6 C 2020-01-10 05:00:40 2020-01-10 05:00:45 5
>>> df.sort_values(['ID', 'end', 'Diff']).groupby(['ID', 'end'], sort=False).head(1)
ID start end Diff
1 A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
3 B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
4 B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
5 C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
6 C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5