Pandas 根据时间戳删除几乎重复的行

Pandas 根据时间戳删除几乎重复的行,pandas,pandas-groupby,drop-duplicates,Pandas,Pandas Groupby,Drop Duplicates,我试图删除一些几乎重复的数据。我正在寻找一种方法,在不丢失信息的情况下检测用户最近的(edited_at)行程 因此,我想通过计算连续时间戳之间的差异来解决这个问题,并移除最小差异(本例中为零:步骤1) 我愿意接受其他建议 注意: data = [[111, 121, "2019-10-22 05:00:00", 0], [111, 121, "2019-10-22 05:00:00", 1], [111, 123, &qu

我试图删除一些几乎重复的数据。我正在寻找一种方法,在不丢失信息的情况下检测用户最近的(
edited_at
)行程

因此,我想通过计算连续时间戳之间的差异来解决这个问题,并移除最小差异(本例中为零:步骤1)

我愿意接受其他建议

注意:

data = [[111, 121, "2019-10-22 05:00:00", 0],
        [111, 121, "2019-10-22 05:00:00", 1],
        [111, 123, "2019-10-22 07:10:00", 0], 
        [111, 123, "2019-10-22 07:10:00", 1], 
        [111, 123, "2019-10-22 07:10:00", 2],
        [111, 124, "2019-10-22 07:20:00", 0],
        [111, 124, "2019-10-22 07:20:00", 1],
        [111, 124, "2019-10-22 07:20:00", 2],
        [111, 124, "2019-10-22 07:20:00", 3],
        [111, 125, "2019-10-22 19:20:00", 0], 
        [111, 125, "2019-10-22 19:20:00", 1],
        [222, 223, "2019-11-24 06:00:00", 0], 
        [222, 223, "2019-11-24 06:00:00", 1], 
        [222, 244, "2019-11-24 06:15:00", 0],
        [222, 244, "2019-11-24 06:15:00", 1],
        [222, 255, "2019-11-24 18:15:10", 0],
        [222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"]) 

df['edited_at'] = pd.to_datetime(df['edited_at'])
111, 121, "2019-10-22 05:00:00", 0, somthing, 
111, 121, "2019-10-22 05:00:00", 1, somthing, 
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0,   0, 
222, 223, "2019-11-24 06:00:00", 1,   0, 
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
111, 121, "2019-10-22 05:00:00", 0,  somthing,
111, 121, "2019-10-22 05:00:00", 1,  somthing, 
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
不要使用
drop\u duplicates()
函数

数据帧:

data = [[111, 121, "2019-10-22 05:00:00", 0],
        [111, 121, "2019-10-22 05:00:00", 1],
        [111, 123, "2019-10-22 07:10:00", 0], 
        [111, 123, "2019-10-22 07:10:00", 1], 
        [111, 123, "2019-10-22 07:10:00", 2],
        [111, 124, "2019-10-22 07:20:00", 0],
        [111, 124, "2019-10-22 07:20:00", 1],
        [111, 124, "2019-10-22 07:20:00", 2],
        [111, 124, "2019-10-22 07:20:00", 3],
        [111, 125, "2019-10-22 19:20:00", 0], 
        [111, 125, "2019-10-22 19:20:00", 1],
        [222, 223, "2019-11-24 06:00:00", 0], 
        [222, 223, "2019-11-24 06:00:00", 1], 
        [222, 244, "2019-11-24 06:15:00", 0],
        [222, 244, "2019-11-24 06:15:00", 1],
        [222, 255, "2019-11-24 18:15:10", 0],
        [222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"]) 

df['edited_at'] = pd.to_datetime(df['edited_at'])
111, 121, "2019-10-22 05:00:00", 0, somthing, 
111, 121, "2019-10-22 05:00:00", 1, somthing, 
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0,   0, 
222, 223, "2019-11-24 06:00:00", 1,   0, 
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
111, 121, "2019-10-22 05:00:00", 0,  somthing,
111, 121, "2019-10-22 05:00:00", 1,  somthing, 
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
第1步:

data = [[111, 121, "2019-10-22 05:00:00", 0],
        [111, 121, "2019-10-22 05:00:00", 1],
        [111, 123, "2019-10-22 07:10:00", 0], 
        [111, 123, "2019-10-22 07:10:00", 1], 
        [111, 123, "2019-10-22 07:10:00", 2],
        [111, 124, "2019-10-22 07:20:00", 0],
        [111, 124, "2019-10-22 07:20:00", 1],
        [111, 124, "2019-10-22 07:20:00", 2],
        [111, 124, "2019-10-22 07:20:00", 3],
        [111, 125, "2019-10-22 19:20:00", 0], 
        [111, 125, "2019-10-22 19:20:00", 1],
        [222, 223, "2019-11-24 06:00:00", 0], 
        [222, 223, "2019-11-24 06:00:00", 1], 
        [222, 244, "2019-11-24 06:15:00", 0],
        [222, 244, "2019-11-24 06:15:00", 1],
        [222, 255, "2019-11-24 18:15:10", 0],
        [222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"]) 

df['edited_at'] = pd.to_datetime(df['edited_at'])
111, 121, "2019-10-22 05:00:00", 0, somthing, 
111, 121, "2019-10-22 05:00:00", 1, somthing, 
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0,   0, 
222, 223, "2019-11-24 06:00:00", 1,   0, 
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
111, 121, "2019-10-22 05:00:00", 0,  somthing,
111, 121, "2019-10-22 05:00:00", 1,  somthing, 
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
第二步:

data = [[111, 121, "2019-10-22 05:00:00", 0],
        [111, 121, "2019-10-22 05:00:00", 1],
        [111, 123, "2019-10-22 07:10:00", 0], 
        [111, 123, "2019-10-22 07:10:00", 1], 
        [111, 123, "2019-10-22 07:10:00", 2],
        [111, 124, "2019-10-22 07:20:00", 0],
        [111, 124, "2019-10-22 07:20:00", 1],
        [111, 124, "2019-10-22 07:20:00", 2],
        [111, 124, "2019-10-22 07:20:00", 3],
        [111, 125, "2019-10-22 19:20:00", 0], 
        [111, 125, "2019-10-22 19:20:00", 1],
        [222, 223, "2019-11-24 06:00:00", 0], 
        [222, 223, "2019-11-24 06:00:00", 1], 
        [222, 244, "2019-11-24 06:15:00", 0],
        [222, 244, "2019-11-24 06:15:00", 1],
        [222, 255, "2019-11-24 18:15:10", 0],
        [222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"]) 

df['edited_at'] = pd.to_datetime(df['edited_at'])
111, 121, "2019-10-22 05:00:00", 0, somthing, 
111, 121, "2019-10-22 05:00:00", 1, somthing, 
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0,   0, 
222, 223, "2019-11-24 06:00:00", 1,   0, 
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
111, 121, "2019-10-22 05:00:00", 0,  somthing,
111, 121, "2019-10-22 05:00:00", 1,  somthing, 
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720

我可能不理解所有的需求,但我已经从我期望看到的示例输出中推断出来了。'Split以获取'resp'列的状态。使用
groupby().firts()
获取拆分状态的第一行。现在我们已经修复了列名和列顺序

df1 = pd.concat([df, df['resp'].str.split(',', expand=True)], axis=1).drop('resp',axis=1)
df1 = df1.groupby(1, as_index=False).first().sort_values('edited_at', ascending=True)
df1.drop(0, axis=1,inplace=True)
df1.columns = ['resp','prompt_uuid','displayed_at','edited_at','latitude','longitude','prompt_num','uuid']
df1 = df1.iloc[:,[1,0,2,3,4,5,6,7]]

df1
prompt_uuid resp    displayed_at    edited_at   latitude    longitude   prompt_num  uuid
1   ab123-9600-3ee130b2c1ff foot    2019-10-22 22:39:57 2019-10-22 23:15:07 44.618787   -72.616841  0   4248-b313-ef2206755488
2   ab123-9600-3ee130b2c1ff metro   2019-10-22 22:50:35 2019-10-22 23:15:07 44.617968   -72.615851  1   4248-b313-ef2206755488
4   ab123-9600-3ee130b2c1ff work    2019-10-22 22:59:20 2019-10-22 23:15:07 44.616902   -72.614793  2   4248-b313-ef2206755488
3   zw999-1555-8ee140b2w1aa shopping    2019-11-23 08:01:35 2019-10-23 08:38:07 44.617968   -72.615851  1   4248-b313-ef2206755488
0   zw999-1555-8ee140b2w1bb bike    2019-11-23 07:39:57 2019-10-23 08:45:24 44.618787   -72.616841  0   4248-b313-ef2206755488

我可能不理解所有的需求,但我已经从我期望看到的示例输出中推断出来了。'Split以获取'resp'列的状态。使用
groupby().firts()
获取拆分状态的第一行。现在我们已经修复了列名和列顺序

df1 = pd.concat([df, df['resp'].str.split(',', expand=True)], axis=1).drop('resp',axis=1)
df1 = df1.groupby(1, as_index=False).first().sort_values('edited_at', ascending=True)
df1.drop(0, axis=1,inplace=True)
df1.columns = ['resp','prompt_uuid','displayed_at','edited_at','latitude','longitude','prompt_num','uuid']
df1 = df1.iloc[:,[1,0,2,3,4,5,6,7]]

df1
prompt_uuid resp    displayed_at    edited_at   latitude    longitude   prompt_num  uuid
1   ab123-9600-3ee130b2c1ff foot    2019-10-22 22:39:57 2019-10-22 23:15:07 44.618787   -72.616841  0   4248-b313-ef2206755488
2   ab123-9600-3ee130b2c1ff metro   2019-10-22 22:50:35 2019-10-22 23:15:07 44.617968   -72.615851  1   4248-b313-ef2206755488
4   ab123-9600-3ee130b2c1ff work    2019-10-22 22:59:20 2019-10-22 23:15:07 44.616902   -72.614793  2   4248-b313-ef2206755488
3   zw999-1555-8ee140b2w1aa shopping    2019-11-23 08:01:35 2019-10-23 08:38:07 44.617968   -72.615851  1   4248-b313-ef2206755488
0   zw999-1555-8ee140b2w1bb bike    2019-11-23 07:39:57 2019-10-23 08:45:24 44.618787   -72.616841  0   4248-b313-ef2206755488

因为您的数据帧与
['user\u id','prompt\u uuid']
重复,所以简单的
diff
不会给出连续组之间的时间差。首先
删除重复项
,然后计算每个
'user\u id'
内的时差。然后,您可以对此进行筛选,以找到每个用户的最小时间差:

s = df.drop_duplicates(['user_id', 'prompt_uuid']).copy()
s['time_diff'] = s.groupby('user_id')['edited_at'].diff(-1).abs()
s = s[s['time_diff'] == s.groupby('user_id')['time_diff'].transform('min')]

#    user_id  prompt_uuid           edited_at  prompt_num time_diff
#2       111          123 2019-10-22 07:10:00           0  00:10:00
#11      222          223 2019-11-24 06:00:00           0  00:15:00
现在,如果您想进一步将其子集到时差在某个小阈值内的行(即,您可以保留一个最小时差为4小时的组…)



因为您的数据帧与
['user\u id','prompt\u uuid']
重复,所以简单的
diff
不会给出连续组之间的时间差。首先
删除重复项
,然后计算每个
'user\u id'
内的时差。然后,您可以对此进行筛选,以找到每个用户的最小时间差:

s = df.drop_duplicates(['user_id', 'prompt_uuid']).copy()
s['time_diff'] = s.groupby('user_id')['edited_at'].diff(-1).abs()
s = s[s['time_diff'] == s.groupby('user_id')['time_diff'].transform('min')]

#    user_id  prompt_uuid           edited_at  prompt_num time_diff
#2       111          123 2019-10-22 07:10:00           0  00:10:00
#11      222          223 2019-11-24 06:00:00           0  00:15:00
现在,如果您想进一步将其子集到时差在某个小阈值内的行(即,您可以保留一个最小时差为4小时的组…)



谢谢你的回复,但这不是我想要的。检查我添加了更多的解释,如果我不清楚,我道歉谢谢你的回复,但这不是我想要的。检查我添加了更多的解释,如果我不清楚,我道歉案例1和案例2都不是一个清晰的标准。最近的日期和最后的数据不是按时间顺序相同吗?如果是这样的话。你可以通过
groupby().last()
获得答案。我已经更新了我的问题,让它更清晰,有了探索的轨迹。如果你早到05:00,为什么7:10是零?@Scott Boston抱歉,这是一个疏忽,现在你明白了吗?我们的目标是消除几乎相似的
提示符\u uuid
。现在解释一下你为什么要取消7:10,逻辑是什么?在第二组中,你将在6:00淘汰第一组。病例1和病例2都不是明确的标准。最近的日期和最后的数据不是按时间顺序相同吗?如果是这样的话。你可以通过
groupby().last()
获得答案。我已经更新了我的问题,让它更清晰,有了探索的轨迹。如果你早到05:00,为什么7:10是零?@Scott Boston抱歉,这是一个疏忽,现在你明白了吗?我们的目标是消除几乎相似的
提示符\u uuid
。现在解释一下你为什么要取消7:10,逻辑是什么?在第二组中,你将在6:00淘汰第一组。这并没有涵盖我所有的真实案例,但它给了我一个解决问题的想法。谢谢。它并没有涵盖我所有的真实案例,但它给了我一个解决问题的想法。非常感谢。