Python 使用pandas将两个dfs的参数转换为新参数
我有两个数据帧,它们都引用相同的事件(由Python 使用pandas将两个dfs的参数转换为新参数,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,我有两个数据帧,它们都引用相同的事件(由id标记)。一个df是离散的,在几个月内以某种分辨率显示事件的过程(df1仅显示摘录),另一个df总结每个事件的参数(df_事件) 简化数据: df(原来的df有更多的行!) 输出: id date numb 2020-01-01 12:00:00 1 2020-01-01 12:00:00 1 2020-01-01 13:00:00 1 2020-01-01 12:00:00
id
标记)。一个df是离散的,在几个月内以某种分辨率显示事件的过程(df1仅显示摘录),另一个df总结每个事件的参数(df_事件)
简化数据:
df(原来的df有更多的行!)
输出:
id date numb
2020-01-01 12:00:00 1 2020-01-01 12:00:00 1
2020-01-01 13:00:00 1 2020-01-01 12:00:00 5
2020-01-01 14:00:00 1 2020-01-01 12:00:00 8
2020-01-01 15:00:00 2 2020-01-05 15:00:00 0
2020-01-01 16:00:00 2 2020-01-05 15:00:00 4
2020-01-01 17:00:00 2 2020-01-05 15:00:00 11
2020-01-01 18:00:00 2 2020-01-05 15:00:00 25
date numb_total timedelta
id
1 2020-01-01 12:00:00 8 00:55:00
2 2020-01-01 15:00:00 25 01:00:00
3 2020-01-08 07:00:00 11 00:45:00
4 2020-01-15 13:00:00 14 00:15:00
5 2020-01-22 12:00:00 8 00:30:00
df_事件:
df_event = pd.DataFrame({'id':[1,2,3,4,5],
'date':['2020-01-01 12:00:00','2020-01-01 15:00:00','2020-01-08 07:00:00','2020-01-15 13:00:00','2020-01-22 12:00:00'],
'numb_total':[8,25,11,14,8],
'timedelta': [55,60,45,15,30]})
df_event = df_event.set_index('id')
df_event['date'] = pd.to_datetime(df_event['date'])
df_event['timedelta'] = pd.to_timedelta(df_event['timedelta'], unit='T')
输出:
id date numb
2020-01-01 12:00:00 1 2020-01-01 12:00:00 1
2020-01-01 13:00:00 1 2020-01-01 12:00:00 5
2020-01-01 14:00:00 1 2020-01-01 12:00:00 8
2020-01-01 15:00:00 2 2020-01-05 15:00:00 0
2020-01-01 16:00:00 2 2020-01-05 15:00:00 4
2020-01-01 17:00:00 2 2020-01-05 15:00:00 11
2020-01-01 18:00:00 2 2020-01-05 15:00:00 25
date numb_total timedelta
id
1 2020-01-01 12:00:00 8 00:55:00
2 2020-01-01 15:00:00 25 01:00:00
3 2020-01-08 07:00:00 11 00:45:00
4 2020-01-15 13:00:00 14 00:15:00
5 2020-01-22 12:00:00 8 00:30:00
现在,我想将两个dfs链接在一起,以便获得一个日/周配置文件。df应按小时/天排序。然后应在此处显示该时间段的numb
和timedelta
的平均值
周配置文件应显示哪个numb
和timedelta
(来自df_事件)是相应时刻=天+时间的平均值(有趣的是任何时刻的最小值和最大值)
例如,创建一个新的df2,如:
df['day'] = df['date'].dt.day_name()
df['time'] = df['date'].dt.time
df_event = df.groupby(['day', 'time'])...
然后添加'df_事件的数据,得到如下结果:
timedelta numb_total
day time
Monday 00:00:00 00:00:00 0
Monday 01:00:00 00:00:00 0
...
Wednesday 11:00:00 00:00:00 0
Wednesday 12:00:00 00:55:00 8
...
Sunday 14:00:00 00:00:00 0
Sunday 15:00:00 01:00:00 25
Sunday 16:00:00 00:00:00 0
...
Sunday 23:00:00 00:00:00 0
IIUC首先聚合两个数据帧,然后合并在一起:
df_event = df_event.set_index('id')
df_event['date'] = pd.to_datetime(df_event['date'])
df_event['day'] = df_event['date'].dt.day_name()
df_event['time'] = df_event['date'].dt.time
df_event1 = df_event.groupby(['day', 'time'])[['timedelta', 'numb_total']].mean()
print (df_event1)
timedelta numb_total
day time
Wednesday 07:00:00 45.0 11.0
12:00:00 42.5 8.0
13:00:00 15.0 14.0
15:00:00 60.0 25.0
df['day'] = df['date'].dt.day_name()
df['time'] = df['date'].dt.time
df_event2 = df.groupby(['day', 'time'])['numb'].mean()
print (df_event2)
day time
Sunday 15:00:00 10.000000
Wednesday 12:00:00 4.666667
Name: numb, dtype: float64
df = df_event1.join(df_event2, how='inner' )
df['timedelta'] = pd.to_timedelta(df['timedelta'], unit='T')
print (df)
timedelta numb_total numb
day time
Wednesday 12:00:00 0 days 00:42:30 8.0 4.666667
#df中的索引和日期之间的关系是什么?都是日期。哪个与df_事件日期有关
很高兴在你澄清后再复习
#Generate column key in each datframe extracting hour. Merge the two dataframes on key. Drop columns not required
df2=pd.merge(df.assign(key=df.index.hour),df_event.assign(key=df_event.set_index('date')\
.index.hour),on=['key','date'],how='right').dropna().drop_duplicates(keep='last')[['date','numb_total','timedelta']]
#Extract time and day_name
df2['time']=df2.date.dt.strftime('%H:%M:%S')
df2['day']=df2.date.dt.day_name()
date n umb_total timedelta time day
0 2020-01-01 12:00:00 8 00:55:00 12:00:00 Wednesday