Python 噪声约束下的时间排序数据帧
我有一个数据框Python 噪声约束下的时间排序数据帧,python,algorithm,dataframe,group-by,ranking,Python,Algorithm,Dataframe,Group By,Ranking,我有一个数据框df,有三列,即,日期、时间、名称(可以有更多的额外列)df按时间的升序排序。在任何给定的日期上,可能有多个时间值,它们可以相隔5分钟,也可以相隔>15分钟。在任何一天,5分钟内的任何事情都应视为相同。我想添加列TimeRank,该列在任何给定的一天将类似的Time在5分钟内聚集在一起,并为它们提供相同的TimeRank。比如说, Date Name Time TimeRank 0 2017-01-01 Henr
df
,有三列,即,日期、时间、名称
(可以有更多的额外列)df
按时间的升序排序。在任何给定的日期
上,可能有多个时间
值,它们可以相隔5分钟,也可以相隔>15分钟。在任何一天,5分钟内的任何事情都应视为相同。我想添加列TimeRank
,该列在任何给定的一天将类似的Time
在5分钟内聚集在一起,并为它们提供相同的TimeRank
。比如说,
Date Name Time TimeRank
0 2017-01-01 Henry 2017-01-01 09:21:01 1
1 2017-01-01 John 2017-01-01 09:23:43 1
2 2017-01-01 Svetlana 2017-01-01 10:15:01 2
3 2017-01-01 Sara 2017-01-01 11:01:01 3
4 2017-01-01 Whitney 2017-01-01 11:03:03 3
5 2017-01-02 Lara 2017-01-02 11:03:03 1
6 2017-01-02 Eugene 2017-01-02 16:46:00 2
7 2017-01-02 Richard 2017-01-02 16:46:00 2
8 2017-01-03 Andy 2017-01-03 11:01:01 1
9 2017-01-03 Paul 2017-01-03 11:03:03 1
下面我创建了一个示例df
。不幸的是,我不得不使用较旧版本的pandas
0.16
import pandas as pd
from random import randint
from datetime import time
dates = pd.date_range('2017-01-01', '2017-01-04')
dates2 = [dates[i] for i in [randint(0, len(dates) -1) for i in range (0, 100)]]
timelist = [time(9,20,45), time(9,21,0), time(9,23,43), time(9,50,0), time(10,15,1), time(11,1,1), time(11,3,3), time(16,45,0), time(16,46,0)]
timelist2 = [timelist[i] for i in [randint(0, len(timelist) -1) for i in range (0, 100)]]
names = ['henry', 'tom', 'andy', 'lara', 'whitney', 'eleanor', 'paloma', 'john', 'james', 'svetlana', 'paul']
names2 = [names[i] for i in [randint(0, len(names)-1) for i in range (0, 100)]]
df = pd.DataFrame({'Date':dates2, 'Time':timelist2, 'Name':names2})
df['Time'] = df.apply(lambda r:pd.datetime.combine(r['Date'],r['Time']), axis=1)
df.sort('Time', inplace=True)
df.loc[:,'minutes']=df.apply(λx:x['Time'].分钟+60*x['Time'].小时,轴=1)
df.loc[:,'delTime']=df.groupby('Date')['minutes'].diff()
df.loc[(df['delTime']=-5),'delTime']=0
df.loc[np.isnan(df['delTime']),'delTime']=1。
df.loc[(df['delTime'])==0,'delTime']=np.nan
df.loc[~np.isnan(df['delTime']),'delTime']=df['minutes']
df=df.ffill()
df.loc[:,'TimeRank']=df.groupby('Date')['delTime'].rank(method='densite')
df.drop(['minutes','delTime',,inplace=True,axis=1)
df.loc[:, 'minutes'] = df.apply(lambda x:x['Time'].minute + 60*x['Time'].hour, axis=1)
df.loc[:, 'delTime'] = df.groupby('Date')['minutes'].diff()
df.loc[(df['delTime'] <=5) & (df['delTime'] >=-5), 'delTime'] = 0
df.loc[np.isnan(df['delTime']), 'delTime'] = 1.
df.loc[(df['delTime']) == 0, 'delTime'] = np.nan
df.loc[~np.isnan(df['delTime']), 'delTime'] = df['minutes']
df = df.ffill()
df.loc[:, 'TimeRank'] = df.groupby('Date')['delTime'].rank(method='dense')
df.drop(['minutes', 'delTime'], inplace=True, axis=1)