Python 如何在一行中计算数据帧中的并发事件?

Python 如何在一行中计算数据帧中的并发事件?,python,python-3.x,datetime,pandas,conditional-statements,Python,Python 3.x,Datetime,Pandas,Conditional Statements,我有一个电话数据集。我想计算每条记录有多少个活动呼叫。 我发现了这个,但我想避免循环和函数 每个呼叫都有一个日期、一个开始时间和一个结束时间 数据帧: start end date 0 09:17:12 09:18:20 2016-08-10 1 09:15:58 09:17:42 2016-08-11 2 09:16:40 09:17:49 2016-08-11 3 09:17:05 09:18:03 2016-08-11 4 0

我有一个电话数据集。我想计算每条记录有多少个活动呼叫。 我发现了这个,但我想避免循环和函数

每个呼叫都有一个
日期
、一个
开始时间
和一个
结束时间

数据帧:

      start       end        date
0  09:17:12  09:18:20  2016-08-10
1  09:15:58  09:17:42  2016-08-11
2  09:16:40  09:17:49  2016-08-11
3  09:17:05  09:18:03  2016-08-11
4  09:18:22  09:18:30  2016-08-11
我想要的是:

      start       end        date  activecalls
0  09:17:12  09:18:20  2016-08-10            1
1  09:15:58  09:17:42  2016-08-11            1
2  09:16:40  09:17:49  2016-08-11            2
3  09:17:05  09:18:03  2016-08-11            3
4  09:18:22  09:18:30  2016-08-11            1
我的代码:

import pandas as pd

df = pd.read_clipboard(sep='\s\s+')

df['activecalls'] = df[(df['start'] <= df.loc[df.index]['start']) & \
                       (df['end'] > df.loc[df.index]['start']) & \
                       (df['date'] == df.loc[df.index]['date'])].count()

print(df)
您可以使用:

#convert time and date to datetime
df['date_start'] = pd.to_datetime(df.start + ' ' + df.date)
df['date_end'] = pd.to_datetime(df.end + ' ' + df.date)
#remove columns
df = df.drop(['start','end','date'], axis=1)
带循环的解决方案:

active_events= []
for i in df.index:
    active_events.append(len(df[(df["date_start"]<=df.loc[i,"date_start"]) & 
                                (df["date_end"]> df.loc[i,"date_start"])]))
df['activecalls'] = pd.Series(active_events)
print (df)
           date_start            date_end  activecalls
0 2016-08-10 09:17:12 2016-08-10 09:18:20            1
1 2016-08-11 09:15:58 2016-08-11 09:17:42            1
2 2016-08-11 09:16:40 2016-08-11 09:17:49            2
3 2016-08-11 09:17:05 2016-08-11 09:18:03            3
4 2016-08-11 09:18:22 2016-08-11 09:18:30            1
计时

def a(df):
    active_events= []
    for i in df.index:
        active_events.append(len(df[(df["date_start"]<=df.loc[i,"date_start"]) & (df["date_end"]> df.loc[i,"date_start"])]))
    df['activecalls'] = pd.Series(active_events)
    return (df)

def b(df):
    df['tmp'] = 1
    df1 = pd.merge(df,df.reset_index(),on=['tmp'])
    df = df.drop('tmp', axis=1)
    df1 = df1[(df1["date_start_x"]<=df1["date_start_y"])  & (df1["date_end_x"]> df1["date_start_y"])]
    df['activecalls'] = df1.groupby('index').size()
    return (df)

print (a(df))
print (b(df))

In [160]: %timeit (a(df))
100 loops, best of 3: 6.76 ms per loop

In [161]: %timeit (b(df))
The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.61 ms per loop
defa(df):
活动_事件=[]
对于df.index中的i:
活动事件.append(len(df[(df[“日期开始”]df.loc[i,“日期开始”]))
df['activecalls']=pd.Series(活动事件)
返回(df)
def b(df):
df['tmp']=1
df1=pd.merge(df,df.reset_index(),on=['tmp'])
df=df.下降('tmp',轴=1)
df1=df1[(df1[“日期开始”]df1[“日期开始”]
df['activecalls']=df1.groupby('index').size()
返回(df)
打印(a(df))
打印(b(df))
在[160]中:%timeit(a(df))
100圈,最佳3圈:每圈6.76毫秒
在[161]中:%timeit(b(df))
最慢的跑步比最快的跑长4.42倍。这可能意味着正在缓存中间结果。
100圈,最佳3圈:每圈4.61毫秒

请查看@jezrael谢谢您的推荐。我希望能在一行中完成。我尝试了这个循环的解决方案,它成功了。谢谢你的帮助。你知道为什么我的代码不起作用吗?我也尝试了
df['activecalls']=len(df[…])
,但没有成功。我认为这是一个复杂的问题,所以解决方案也很复杂。和
df[(df['start']df.loc[df.index]['start'])和\(df['date']==df.loc[df.index]['date']).count()
不起作用,因为您需要将
start
列的每个值与
start
的所有值以及
end
的所有值进行比较。因此,我使用合并来代替循环,但有时使用循环的解决方案更简单、更快。我添加了计时,第二个
merge
解决方案更快。显然,我对数据帧有一些误解。我得查一查。谢谢你的帮助。很高兴能帮助你!天气真好!
#cross join
df['tmp'] = 1
df1 = pd.merge(df,df.reset_index(),on=['tmp'])
df = df.drop('tmp', axis=1)
#print (df1)

#filtering by conditions
df1 = df1[(df1["date_start_x"]<=df1["date_start_y"])  
          (df1["date_end_x"]> df1["date_start_y"])]
print (df1)
          date_start_x          date_end_x  activecalls_x  tmp  index  \
0  2016-08-10 09:17:12 2016-08-10 09:18:20              1    1      0   
6  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      1   
7  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      2   
8  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      3   
12 2016-08-11 09:16:40 2016-08-11 09:17:49              2    1      2   
13 2016-08-11 09:16:40 2016-08-11 09:17:49              2    1      3   
18 2016-08-11 09:17:05 2016-08-11 09:18:03              3    1      3   
24 2016-08-11 09:18:22 2016-08-11 09:18:30              1    1      4   

          date_start_y          date_end_y  activecalls_y  
0  2016-08-10 09:17:12 2016-08-10 09:18:20              1  
6  2016-08-11 09:15:58 2016-08-11 09:17:42              1  
7  2016-08-11 09:16:40 2016-08-11 09:17:49              2  
8  2016-08-11 09:17:05 2016-08-11 09:18:03              3  
12 2016-08-11 09:16:40 2016-08-11 09:17:49              2  
13 2016-08-11 09:17:05 2016-08-11 09:18:03              3  
18 2016-08-11 09:17:05 2016-08-11 09:18:03              3  
24 2016-08-11 09:18:22 2016-08-11 09:18:30              1  
#get size - active calls
print (df1.groupby(['index'], sort=False).size())
index
0    1
1    1
2    2
3    3
4    1
dtype: int64

df['activecalls'] = df1.groupby('index').size()
print (df)
           date_start            date_end  activecalls
0 2016-08-10 09:17:12 2016-08-10 09:18:20            1
1 2016-08-11 09:15:58 2016-08-11 09:17:42            1
2 2016-08-11 09:16:40 2016-08-11 09:17:49            2
3 2016-08-11 09:17:05 2016-08-11 09:18:03            3
4 2016-08-11 09:18:22 2016-08-11 09:18:30            1
def a(df):
    active_events= []
    for i in df.index:
        active_events.append(len(df[(df["date_start"]<=df.loc[i,"date_start"]) & (df["date_end"]> df.loc[i,"date_start"])]))
    df['activecalls'] = pd.Series(active_events)
    return (df)

def b(df):
    df['tmp'] = 1
    df1 = pd.merge(df,df.reset_index(),on=['tmp'])
    df = df.drop('tmp', axis=1)
    df1 = df1[(df1["date_start_x"]<=df1["date_start_y"])  & (df1["date_end_x"]> df1["date_start_y"])]
    df['activecalls'] = df1.groupby('index').size()
    return (df)

print (a(df))
print (b(df))

In [160]: %timeit (a(df))
100 loops, best of 3: 6.76 ms per loop

In [161]: %timeit (b(df))
The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.61 ms per loop