Python 基于groupby和condition对列求和
我有一个数据框和一些列。我想对“间隙”列求和,其中时间在某些时隙中Python 基于groupby和condition对列求和,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框和一些列。我想对“间隙”列求和,其中时间在某些时隙中 region. date. time. gap 0 1 2016-01-01 00:00:08 1 1 1 2016-01-01 00:00:48 0 2 1 2016-01-01 00:02:50 1 3 1 2016-01-01 00:00:52 0 4 1 2016-01-01 00:10:01 0 5 1 20
region. date. time. gap
0 1 2016-01-01 00:00:08 1
1 1 2016-01-01 00:00:48 0
2 1 2016-01-01 00:02:50 1
3 1 2016-01-01 00:00:52 0
4 1 2016-01-01 00:10:01 0
5 1 2016-01-01 00:10:03 1
6 1 2016-01-01 00:10:05 0
7 1 2016-01-01 00:10:08 0
我想和gap列求和。我在dict上有这样的时间段
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
现在求和之后,上面的数据帧应该是这样的
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
我有许多地区和144个时段,从00:00:00到23:59:49。我试过这个
regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()
regres=reg.groupby(['start\u region\u hash','Date','Time'])['Time'].apply(lambda x:(x>=hoursdict['slot1'])和(xIdea是将列时间
转换为日期时间
,然后转换为字符串HH:MM:SS
:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
使用交换键和值按字典聚合sum
和最后一个值:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
如果要显示下一个10分钟
插槽:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
编辑:
对floor和convert to string的改进是使用bining by或searchsorted
:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
解决此问题的方法是首先将time
列转换为所需的值,然后对time
列执行groupby sum
下面的代码显示了我所使用的方法。我使用了np。选择以包含我想要的尽可能多的条件和条件选项。在我将time
转换为我想要的值后,我做了一个简单的groupby sum
真正不需要格式化时间或转换字符串等麻烦事,只需让pandas dataframe直观地处理即可
#Just creating the DataFrame using a dictionary here
regdict = {
'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
'gap': [1,0,1,0,0,1,0,0],}
df = pd.DataFrame(regdict)
import pandas as pd
import numpy as np #This is the library you require for np.select function
#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00']
choicelist = ['00:10:00/slot1','00:20:00/slot2']
#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
time gap
0 00:10:00 1
1 00:10:00 0
2 00:10:00 1
3 00:10:00 0
4 00:20:00 0
5 00:20:00 1
6 00:20:00 0
7 00:20:00 0
df = df.groupby('time', as_index=False)['gap'].sum()
print (df)
time gap
0 00:10:00 2
1 00:20:00 1
为了避免Datetime比较的复杂性(除非这是你的全部观点,在这种情况下,忽略我的答案),并展示这个按槽分组窗口问题的本质,我在这里假设时间是整数
df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056],
'gap': [1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()
输出
gap
slot
-----------
0 2
1000 1
1500 3
这是一个缓慢的过程吗?在我的超过60万条记录的数据中花费了太多的时间。@shahidhamdam-需要更多的时间,但可能是faster@shahidhamdam-一件事-你需要第二个解决方案还是第一个解决方案?我需要第一个解决方案,但还有一件事。我还想计算一个时间段中的行数。例如,在slot1中,有4行。你能帮我吗?@shahidhamdam-使用regres=df.groupby(['region'、'date'、'time'、'time1'],as_index=False).size().reset_index(name='count')