Python 使用Pandas聚合具有开始和结束时间的事件
我有许多事件的数据,开始和结束时间如下:Python 使用Pandas聚合具有开始和结束时间的事件,python,pandas,Python,Pandas,我有许多事件的数据,开始和结束时间如下: df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]}) df['end'] = pd.to_datetime(df['end']) df['start'] = pd.to_datetime(df['start'])
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
date count sum
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-08 0 0
2015-01-09 0 0
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
输出:
现在我需要计算同时活动的事件数,例如,它们的值之和。所以结果应该是这样的:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
date count sum
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-08 0 0
2015-01-09 0 0
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
有什么办法吗?我曾考虑为groupby使用一个定制的石斑鱼,但就我所见,石斑鱼只能将一行分配给一个组,因此看起来没有什么用处
编辑:经过一些测试后,我发现这是一种获得理想结果的相当丑陋的方法:
df['count'] = 1
dates = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
start = df[['start', 'value', 'count']].set_index('start').reindex(dates)
end = df[['end', 'value', 'count']].set_index('end').reindex(dates).shift(1)
rstart = pd.rolling_sum(start, len(start), min_periods=1)
rend = pd.rolling_sum(end, len(end), min_periods=1)
rstart.subtract(rend, fill_value=0).fillna(0)
然而,这只适用于求和,我看不到一个明显的方法使它适用于其他函数。例如,有没有一种方法可以让它使用中位数而不是总和?这就是我想到的。我想有更好的办法 考虑到你的身材
end start value
0 2015-01-07 2015-01-05 3
1 2015-01-15 2015-01-10 4
2 2015-01-13 2015-01-11 5
然后
dList = []
vList = []
d = {}
def buildDict(row):
for x in pd.date_range(row["start"],row["end"]): #build a range for each row
dList.append(x) #date list
vList.append(row["value"]) #value list
df.apply(buildDict,axis=1) #each row in df is passed to buildDict
#this d will be used to create our new frame
d["date"] = dList
d["value"] = vList
#from here you can use whatever agg functions you want
pd.DataFrame(d).groupby("date").agg(["count","sum"])
屈服
value
count sum
date
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
如果我使用的是SQL,我会通过将all dates表连接到events表,然后按日期分组来实现这一点。Pandas并没有使这种方法变得特别容易,因为无法在条件上左连接,但我们可以使用伪列和重新索引来伪造它:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df['dummy'] = 1
然后:
它处理空值的效果更好。让人想起计数卷绕或开合定界符,但不清楚如何移植算法。很好,谢谢!这是一种利用条件构造联接表的聪明方法。我必须用一些真实的数据来测试它,看看大表的性能如何。
date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))
final = (
date_df
.merge(df, on='dummy')
.ply_where(X.start <= X.date, X.date <= X.end)
.groupby('date')
.ply_select(val_count=X.size(), val_sum=X.value.sum(), median=X.value.median())
.reindex(date_series)
.ply_select('*', val_count=X.val_count.fillna(0), val_sum=X.val_sum.fillna(0))
.reset_index()
)