Python 识别数据帧中滚动时间窗口中的重复项
我有一个数据帧,我想在滑动时间窗口中识别(并最终删除)重复的行Python 识别数据帧中滚动时间窗口中的重复项,python,pandas,Python,Pandas,我有一个数据帧,我想在滑动时间窗口中识别(并最终删除)重复的行 dict={ 'type': ['apple','apple','apple','berry','grape','apple'], 'attr': ['red','green','red','blue','green','red'], 'timestamp': [ '2021-03-01 12:00:00', '2021-03-01 12:00:30',
dict={
'type': ['apple','apple','apple','berry','grape','apple'],
'attr': ['red','green','red','blue','green','red'],
'timestamp': [ '2021-03-01 12:00:00',
'2021-03-01 12:00:30',
'2021-03-01 12:01:13',
'2021-03-01 12:01:30',
'2021-03-01 12:10:00',
'2021-03-01 12:11:00',
]
}
df = pd.DataFrame(dict)
df['is_dup'] = False
print(df)
在本例中,我的目标是当“type”和“attr”等于2分钟内发生的另一行时,将该行标记为重复行。所以我想将索引2标记为_dup=True,因为它与索引0匹配并且在2分钟的时间范围内,而不是第5行,因为它的时间戳不在窗口内
因此,生成的数据帧如下所示:
type attr timestamp is_dup
0 apple red 2021-03-01 12:00:00 False
1 apple green 2021-03-01 12:00:30 False
2 apple red 2021-03-01 12:01:13 True
3 berry blue 2021-03-01 12:01:30 False
4 grape green 2021-03-01 12:10:00 False
5 apple red 2021-03-01 12:11:00 False
提前感谢。我正在创建一个临时列
diff
,用于分组和存储时差。然后我单独检查时差是否小于2分钟,然后将is_dup
修改为True
df['diff'] = df.groupby(['type', 'attr'])['timestamp'].diff().fillna(pd.Timedelta(seconds=0))
df.loc[(df['diff']>pd.Timedelta(0,'m')) & (df['diff']<=pd.Timedelta(2,'m')), 'is_dup'] = True
df=df.drop(['diff'], axis=1)
print(df)
这回答了你的问题吗?索引0不也应该是
吗_dup=True
?我不希望原始文件被视为dup。稍后我将返回并删除所有行,其中是_dup=True,在这种情况下,我不希望删除原始行。哇,非常感谢!!嗯,df['diff']>pd.Timedelta(0,'m'))
的目的是什么?diff
不是总是正值吗?@tdy有多行的值为零diff value和df['diff']
df['diff'] = df.groupby(['type', 'attr'])['timestamp'].diff().fillna(pd.Timedelta(seconds=0))
df.loc[(df['diff']>pd.Timedelta(0,'m')) & (df['diff']<=pd.Timedelta(2,'m')), 'is_dup'] = True
df=df.drop(['diff'], axis=1)
print(df)
type attr timestamp is_dup
0 apple red 2021-03-01 12:00:00 False
1 apple green 2021-03-01 12:00:30 False
2 apple red 2021-03-01 12:01:13 True
3 berry blue 2021-03-01 12:01:30 False
4 grape green 2021-03-01 12:10:00 False
5 apple red 2021-03-01 12:11:00 False