Python 如何转换数据帧以获取各种事件发生的时间?
给定以下数据帧:Python 如何转换数据帧以获取各种事件发生的时间?,python,pandas,dataframe,pandas-groupby,Python,Pandas,Dataframe,Pandas Groupby,给定以下数据帧: +-------+-----+-------+-----+--------+---------------------------+ | DID | CID | Event | OID | Source | TimeStamp | +-------+-----+-------+-----+--------+---------------------------+ | 25078 | 14 | QBT | 0 | EMS | 2
+-------+-----+-------+-----+--------+---------------------------+
| DID | CID | Event | OID | Source | TimeStamp |
+-------+-----+-------+-----+--------+---------------------------+
| 25078 | 14 | QBT | 0 | EMS | 2019-10-15 10:54:35 +0000 |
| 25078 | 14 | NDOBT | 0 | EMS | 2019-10-15 10:54:48 +0000 |
| 25078 | 14 | SBT | 0 | EMS | 2019-10-15 10:54:52 +0000 |
| 25078 | 14 | SBT-1 | 0 | ECS | 2019-10-15 11:00:01 +0000 |
| 25078 | 14 | SBT-1 | 0 | ECS | 2019-10-15 11:00:26 +0000 |
| 25078 | 14 | SBT-1 | 0 | ECS | 2019-10-15 11:00:50 +0000 |
| 25078 | 14 | SBT | 0 | EMS | 2019-10-15T14:27:45 |
| 25078 | 14 | SBT | 0 | EMS | 2019-10-15T14:27:45 |
| 25078 | 14 | LSFA | 0 | SPDLS | 2019-10-15T14:28:16 |
| 25078 | 14 | LSFA | 0 | SPDLS | 2019-10-15T14:28:16 |
| 25078 | 14 | FEAR | 0 | CBS | 2019-10-15T14:28:18 |
| 25078 | 14 | FEAR | 0 | CBS | 2019-10-15T14:28:18 |
| 25078 | 14 | SBT | 0 | EMS | 2019-10-15T14:28:44 |
| 25078 | 14 | SBT | 0 | EMS | 2019-10-15T14:28:44 |
| 25078 | 14 | LSFA | 0 | SPDLS | 2019-10-15T14:30:55 |
| 25078 | 14 | LSFA | 0 | SPDLS | 2019-10-15T14:30:55 |
| 25078 | 14 | SBT | 0 | EMS-1 | 2019-10-15T15:28:43 |
| 25078 | 14 | SBT | 0 | EMS-1 | 2019-10-15T15:29:02 |
| 25078 | 14 | FEAR | 0 | CBS | 2019-10-15T15:30:51 |
| 25078 | 14 | FEAR | 0 | CBS | 2019-10-15T15:30:51 |
| 25078 | 14 | DBT | 0 | RS | 2019-10-15T15:44:23 |
| 25078 | 14 | QBT | 0 | EMS-1 | 2019-10-15T16:02:16 |
+-------+-----+-------+-----+--------+---------------------------+
我希望获得一些事件和源的第一次和最后一次出现,以便最终输出如下所示:
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
| DID | CID | Event-QBT-Last-DT | Event-QBT-First-DT | Event-SBT-Last-DT | Event-SBT-First-DT | Screen-ECS-First-DT | Screen-ECS-Last-DT | FirstTimeUsage | LastTime Usage |
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
| 25078 | 14 | 2019-10-15T16:02:16 | 10/15/19 10:54 AM | 2019-10-15T15:29:02 | 10/15/19 10:54 AM | 2019-10-15 11:00:01 +0000 | 2019-10-15 11:00:50 +0000 | 2019-10-15 10:54:35 +0000 | 2019-10-15T16:02:16 |
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
def f(x):
first_used_DT=x.min()['TimeStamp']
last_used_DT=x.max()['TimeStamp']
first_ECS=x[x['Source']=='ECS'].min()['TimeStamp']
last_ECS=x[x['Source']=='ECS'].max()['TimeStamp']
last_OBT=x[x['Event']=='QBT'].max()['TimeStamp']
first_QBT=x[x['Event']=='QBT'].min()['TimeStamp']
last_SBT=x[x['Event']=='SBT'].max()['TimeStamp']
first_SBT=x[x['Event']=='SBT'].min()['TimeStamp']
return pd.DataFrame({'FirstTimeUsage': first_used_DT, 'LastTime Usage': last_used_DT,
'Screen-ECS-First-DT':first_ECS,'Screen-ECS-Last-DT':last_ECS
'Event-QBT-First-DT':first_QBT, 'Event-QBT-Last-DT':last_OBT,
'Event-SBT-First-DT':first_SBT, 'Event-SBT-Last-DT':last_SBT
}, index=[0])
如何使用pandas实现这一点。的想法是通过获取某些事件的第一次和最后一次出现来过滤行,然后使用withfirst
和last
并在列中通过、最后一次展平多索引来重塑:
L = ['QBT','SBT']
df1 = (df[df['Event'].isin(L)]
.groupby(['OID','DID','CID','Event'])['TimeStamp']
.agg([('Last-DT','last'), ('First-DT','first')])
.unstack()
.sort_index(axis=1, level=1))
df1.columns = [f'Event-{b}-{a}' for a, b in df1.columns]
#print (df1)
对于第一次和最后一次出现,请使用不带过滤器的第一个解决方案,groupby中没有事件,也没有取消堆栈
:
df2 = (df.groupby(['OID','DID','CID'])['TimeStamp']
.agg([('FirstTimeUsage','first'), ('LastTime Usage','last')]))
#print (df2)
最后通过以下方式连接:
df = df1.join(df2).reset_index()
print (df)
OID DID CID Event-QBT-First-DT Event-QBT-Last-DT \
0 0 25078 14 2019-10-15 10:54:35 +0000 2019-10-15T16:02:16
Event-SBT-First-DT Event-SBT-Last-DT FirstTimeUsage \
0 2019-10-15 10:54:52 +0000 2019-10-15T15:29:02 2019-10-15 10:54:35 +0000
LastTime Usage
0 2019-10-15T16:02:16
编辑:为了处理下一列,稍微修改生成df1
:
L = ['QBT','SBT']
df1 = (df[df['Event'].isin(L)]
.groupby(['OID','DID','CID','Event'])['TimeStamp']
.agg([('Last-DT','last'), ('First-DT','first')])
.unstack()
.sort_index(axis=1, level=1))
df1.columns = [f'Event-{b}-{a}' for a, b in df1.columns]
#print (df1)
L2 = ['ECS']
df11 = (df[df['Source'].isin(L2)]
.groupby(['OID','DID','CID','Source'])['TimeStamp']
.agg([('Last-DT','last'), ('First-DT','first')])
.unstack()
.sort_index(axis=1, level=1))
df11.columns = [f'Screen-{b}-{a}' for a, b in df11.columns]
df2 = (df.groupby(['OID','DID','CID'])['TimeStamp']
.agg([('FirstTimeUsage','first'), ('LastTime Usage','last')]))
最后一次使用concat
:
df = pd.concat([df1, df11, df2], axis=1).reset_index()
print (df)
OID DID CID Event-QBT-First-DT Event-QBT-Last-DT \
0 0 25078 14 2019-10-15 10:54:35 +0000 2019-10-15T16:02:16
Event-SBT-First-DT Event-SBT-Last-DT Screen-ECS-First-DT \
0 2019-10-15 10:54:52 +0000 2019-10-15T15:29:02 2019-10-15 11:00:01 +0000
Screen-ECS-Last-DT FirstTimeUsage LastTime Usage
0 2019-10-15 11:00:50 +0000 2019-10-15 10:54:35 +0000 2019-10-15T16:02:16
您还可以创建一个函数并使用apply
在数据帧上运行,如下所示:
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
| DID | CID | Event-QBT-Last-DT | Event-QBT-First-DT | Event-SBT-Last-DT | Event-SBT-First-DT | Screen-ECS-First-DT | Screen-ECS-Last-DT | FirstTimeUsage | LastTime Usage |
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
| 25078 | 14 | 2019-10-15T16:02:16 | 10/15/19 10:54 AM | 2019-10-15T15:29:02 | 10/15/19 10:54 AM | 2019-10-15 11:00:01 +0000 | 2019-10-15 11:00:50 +0000 | 2019-10-15 10:54:35 +0000 | 2019-10-15T16:02:16 |
+-------+-----+---------------------+--------------------+---------------------+--------------------+---------------------------+---------------------------+---------------------------+---------------------+
def f(x):
first_used_DT=x.min()['TimeStamp']
last_used_DT=x.max()['TimeStamp']
first_ECS=x[x['Source']=='ECS'].min()['TimeStamp']
last_ECS=x[x['Source']=='ECS'].max()['TimeStamp']
last_OBT=x[x['Event']=='QBT'].max()['TimeStamp']
first_QBT=x[x['Event']=='QBT'].min()['TimeStamp']
last_SBT=x[x['Event']=='SBT'].max()['TimeStamp']
first_SBT=x[x['Event']=='SBT'].min()['TimeStamp']
return pd.DataFrame({'FirstTimeUsage': first_used_DT, 'LastTime Usage': last_used_DT,
'Screen-ECS-First-DT':first_ECS,'Screen-ECS-Last-DT':last_ECS
'Event-QBT-First-DT':first_QBT, 'Event-QBT-Last-DT':last_OBT,
'Event-SBT-First-DT':first_SBT, 'Event-SBT-Last-DT':last_SBT
}, index=[0])
可能会稍微慢一点,但完成了任务。到目前为止您尝试了什么?谢谢您的回答,但我对我的问题进行了编辑。我还想从屏幕columnHi@jezrael中找到一个条目,快速脱离主题的问题。知道如何将多个索引作为方法链的一部分展平为单个索引吗?除了df.columns=
@MarkWang-hmm,我还没有找到解决方案,明白了,需要什么,但似乎还不存在。在我的头脑中,它应该是类似df.set_axis
的可调用解决方案,比如assign
@mu S\N我得到了,需要groupby,因为每个小组工作。这个解决方案实际上是从@jezrael的其他帖子中得到启发的。