Python 如果时间戳已关闭，则删除重复项_Python_Pandas_Duplicates

Python 如果时间戳已关闭，则删除重复项

python pandas

Python 如果时间戳已关闭，则删除重复项,python,pandas,duplicates,Python,Pandas,Duplicates,我有一个数据框，其中包含关于谁在工作、在哪个任务中以及他/她开始工作的时间的“日志”信息： index | Entrance time | Name | Last name | Employee_ID | Task -------------------------------------------------------------------- 0 |2000-01-01 00:00:00 | John | Fischer | 001 | M

我有一个数据框，其中包含关于谁在工作、在哪个任务中以及他/她开始工作的时间的“日志”信息：

 index |    Entrance time   | Name | Last name | Employee_ID  | Task
 --------------------------------------------------------------------
   0   |2000-01-01 00:00:00 | John |  Fischer  |    001       | Maintenance
   1   |2000-01-01 00:04:30 | John |  Fischer  |    001       | Development
   2   |2000-01-01 00:04:30 | Bob  |  Conrad   |    002       | Maintenance
   3   |2000-01-01 00:10:00 | Mary |  Smith    |    003       | Multitasking
   4   |2000-01-01 00:09:30 | John |  Fischer  |    001       | Maintenance
   5   |2000-01-01 00:15:30 | John |  Fischer  |    001       | Maintenance
   6   |2000-01-02 00:04:30 | Bob  |  Conrad   |    002       | Maintenance
   7   |2000-01-02 00:10:00 | Mary |  Smith    |    003       | Multitasking

然后，如果我们正在查找的任务与其他任务之间的进入时间差小于10分钟，并且任务和名称相同，我希望消除重复项。因此，生成的数据帧应该是：

 index |    Entrance time   | Name | Last name | Employee_ID  | Task
 --------------------------------------------------------------------
   0   |2000-01-01 00:00:00 | John |  Fischer  |    001       | Maintenance
   1   |2000-01-01 00:04:30 | John |  Fischer  |    001       | Development
   2   |2000-01-01 00:04:30 | Bob  |  Conrad   |    002       | Maintenance
   3   |2000-01-01 00:10:00 | Mary |  Smith    |    003       | Multitasking
   5   |2000-01-01 00:15:30 | John |  Fischer  |    001       | Maintenance
   6   |2000-01-02 00:04:30 | Bob  |  Conrad   |    002       | Maintenance
   7   |2000-01-02 00:10:00 | Mary |  Smith    |    003       | Multitasking

我使用了drop_重复项（subset=[“Name”、“Last Name”、“Task”]），但我不知道如何应用时间条件将每一行与其余行进行比较

希望您能帮助我，提前谢谢您计算时差，这可能会对您有所帮助。但是，您还需要根据重复案例应用您的条件

# Make df sequential in ["Name", "Last name", "Task"]
df.sort_values(["Name", "Last name", "Task"], inplace=True)

# Compute time difference 
temp = df['Entrance time'] - df['Entrance time'].shift()

# converts the difference in terms of minutes (taking into account absolute values)
df['diff_mins'] = temp.abs() /np.timedelta64(1,'m')

输出：

2  2  2000-01-01 00:04:30  Bob   Conrad   2  Maintenance    nan
6  6  2000-01-02 00:04:30  Bob   Conrad   2  Maintenance   1440
1  1  2000-01-01 00:04:30  John  Fischer  1  Development   1440
0  0  2000-01-01 00:00:00  John  Fischer  1  Maintenance      4.5
4  4  2000-01-01 00:09:30  John  Fischer  1  Maintenance      9.5
5  5  2000-01-01 00:15:30  John  Fischer  1  Maintenance      6
3  3  2000-01-01 00:10:00  Mary  Smith    3  Multitasking     5.5
7  7  2000-01-02 00:10:00  Mary  Smith    3  Multitasking  1440