Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/csharp/262.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 识别数据帧中滚动时间窗口中的重复项_Python_Pandas - Fatal编程技术网

Python 识别数据帧中滚动时间窗口中的重复项

Python 识别数据帧中滚动时间窗口中的重复项,python,pandas,Python,Pandas,我有一个数据帧,我想在滑动时间窗口中识别(并最终删除)重复的行 dict={ 'type': ['apple','apple','apple','berry','grape','apple'], 'attr': ['red','green','red','blue','green','red'], 'timestamp': [ '2021-03-01 12:00:00', '2021-03-01 12:00:30',

我有一个数据帧,我想在滑动时间窗口中识别(并最终删除)重复的行

dict={
    'type': ['apple','apple','apple','berry','grape','apple'],
    'attr': ['red','green','red','blue','green','red'],
    'timestamp': [ '2021-03-01 12:00:00',
                  '2021-03-01 12:00:30',
                  '2021-03-01 12:01:13',
                  '2021-03-01 12:01:30',
                  '2021-03-01 12:10:00',
                  '2021-03-01 12:11:00',
                 ]
}
df = pd.DataFrame(dict)
df['is_dup'] = False
print(df)
在本例中,我的目标是当“type”和“attr”等于2分钟内发生的另一行时,将该行标记为重复行。所以我想将索引2标记为_dup=True,因为它与索引0匹配并且在2分钟的时间范围内,而不是第5行,因为它的时间戳不在窗口内

因此,生成的数据帧如下所示:

    type   attr            timestamp  is_dup
0  apple    red  2021-03-01 12:00:00   False
1  apple  green  2021-03-01 12:00:30   False
2  apple    red  2021-03-01 12:01:13   True
3  berry   blue  2021-03-01 12:01:30   False
4  grape  green  2021-03-01 12:10:00   False
5  apple    red  2021-03-01 12:11:00   False

提前感谢。

我正在创建一个临时列
diff
,用于分组和存储时差。然后我单独检查时差是否小于2分钟,然后将
is_dup
修改为
True

df['diff'] = df.groupby(['type', 'attr'])['timestamp'].diff().fillna(pd.Timedelta(seconds=0))
df.loc[(df['diff']>pd.Timedelta(0,'m')) & (df['diff']<=pd.Timedelta(2,'m')), 'is_dup'] = True
df=df.drop(['diff'], axis=1)
print(df)

这回答了你的问题吗?索引0不也应该是
吗_dup=True
?我不希望原始文件被视为dup。稍后我将返回并删除所有行,其中是_dup=True,在这种情况下,我不希望删除原始行。哇,非常感谢!!嗯,
df['diff']>pd.Timedelta(0,'m'))
的目的是什么?
diff
不是总是正值吗?@tdy有多行的值为零diff value和
df['diff']
df['diff'] = df.groupby(['type', 'attr'])['timestamp'].diff().fillna(pd.Timedelta(seconds=0))
df.loc[(df['diff']>pd.Timedelta(0,'m')) & (df['diff']<=pd.Timedelta(2,'m')), 'is_dup'] = True
df=df.drop(['diff'], axis=1)
print(df)
    type   attr           timestamp  is_dup
0  apple    red 2021-03-01 12:00:00   False
1  apple  green 2021-03-01 12:00:30   False
2  apple    red 2021-03-01 12:01:13    True
3  berry   blue 2021-03-01 12:01:30   False
4  grape  green 2021-03-01 12:10:00   False
5  apple    red 2021-03-01 12:11:00   False