Python Pandas—跨越多个X分钟时间段的任务计数
考虑以下数据:Python Pandas—跨越多个X分钟时间段的任务计数,python,pandas,Python,Pandas,考虑以下数据: Index Task Start Finish 0 RandomName 2018-10-15T13:30:00+00:00 2018-10-15T13:41:00+00:00 1 RandomName 2018-10-15T13:40:00+00:00 2018-10-15T13:51:00+00:00 2 RandomName 2018-10-15T13:50:00
Index Task Start Finish
0 RandomName 2018-10-15T13:30:00+00:00 2018-10-15T13:41:00+00:00
1 RandomName 2018-10-15T13:40:00+00:00 2018-10-15T13:51:00+00:00
2 RandomName 2018-10-15T13:50:00+00:00 2018-10-15T13:51:00+00:00
3 RandomName 2018-10-15T14:10:00+00:00 2018-10-15T14:11:00+00:00
4 RandomName 2018-10-15T14:20:00+00:00 2018-10-15T14:21:00+00:00
5 RandomName 2018-10-15T14:30:00+00:00 2018-10-15T14:31:00+00:00
我试图做的是生成这个数据帧的5分钟段(时隙类型),计算这些任务在所述段中发生的次数,并尝试将其可视化。由于这些任务具有持续时间,我首先必须通过以下方式生成分段:
import pandas as pd
from datetime import datetime, timedelta
def main():
input_file = "input.csv"
df = pd.read_csv(
input_file
,parse_dates=['Start','Finish']
,names=['Index', 'Job', 'Start', 'Finish']
,index_col='Index'
,header=None
)
# Find the duration of each task.
df['Start'] = pd.to_datetime(df['Start'],dayfirst=True, errors='coerce')
df['Finish'] = pd.to_datetime(df['Finish'],dayfirst=True, errors='coerce')
df.loc[:,'Duration'] = df['Finish'].dt.minute - df['Start'].dt.minute
# Define the range and split it into 5 minute segments
rng_min = df['Start'].min() # Earliest Date
rng_max = df['Finish'].max() # Latest Date
current = rng_min
while current < rng_max:
current += timedelta(minutes=5)
if __name__ == "__main__":
main()
您可以在时间序列索引上使用后跟来执行所需操作:
重采样允许您更改日期时间索引的频率。在这种情况下,您希望“向上采样”——增加数据中的步骤数
然后,Reindex允许您使用NAs填补空白
是一种有用的数据分割方法。使用date_range函数以5分钟的频率为列指定分段时间。展开此列,使用itertuples()创建一个新的数据帧,它将遍历数据帧的每一行。从这里,您可以对数据运行groupby函数,或者根据需要对其进行更改
df['Start'] = pd.to_datetime(df['Start'])
df['Finish'] = pd.to_datetime(df['Finish'])
df['Segments'] = df.index.map(lambda x: pd.date_range(start=df['Start'][x], end=df['Finish'][x], freq='5Min'))
df1 = pd.DataFrame([(d, t.Task) for t in df.itertuples() for d in t.Segments])
df1 = df1.rename(columns={0:'Time', 1:'Task'})
grouped = df1.groupby(['Time'])
for time, group in grouped:
print(group)
您可以尝试类似的方法:
#Copying your original dataframe into clipboard buffer
df = pd.read_clipboard(index_col='Index')
df[['Start', 'Finish']] = df[['Start','Finish']].apply(pd.to_datetime)
df_out = df.apply(lambda x: pd.Series(pd.date_range(x.Start, x.Finish, freq='5T')), axis=1)\
.stack()\
.value_counts(bins=pd.date_range(df.Start.min(), df.Finish.max(), freq='5T'))\
.sort_index()
df_out.index = pd.MultiIndex.from_tuples(df_out.index.to_tuples())
df_out = df_out.rename_axis(['Start', 'Finish']).rename('Task Running').reset_index()
print(df_out)
df_out.plot('Start','Task Running')
输出(注:区间开始或结束的包容性不明确,即13:35的值应包含在区间结束或下一个区间开始时):
可视化输出:
您的预期输出是什么样的?在帖子中添加-谢谢!它可以按如下方式执行:(a)创建一个字典,其中键为开始时间、开始时间+5、开始时间+10分钟,直到覆盖最后一条记录的开始时间。(b) 将每个任务的完成时间与每个键进行比较。如果大于键值,则将其附加为列表值。因此,您将有{'start_window1':['T1','T2'],'start_window2':['T1','T2']…}等等,其中T1,T2是任务名称(c)计算与键对应的每个列表的长度,以便获得所需的答案,嗨,Katelie,谢谢!我尝试了您的代码,但有一个错误:段持续时间从5分钟更改为10分钟(在您的回复示例中也是如此)。很抱歉,您需要使用
重新采样
,然后使用重新索引
,来创建所有5分钟的间隔。我编辑了我的答案来更正它。谢谢你,凯特利,你不知道你的帮助有多感激!不幸的是,我在到达重采样部分时仍然遇到错误:ValueError:无法从重复的轴[第40行]重新索引,该轴为“.apply”(lambda x:x.resample(rule={interval}T.)。format(interval=minutes\u per_segment),label='right',closed='right')`.你能帮我一下吗?是因为任务名称不唯一吗?嗯…你有开始时间和结束时间相同的行吗(特别是分钟)它对我来说是非唯一任务名称和直行重复行的组合,但不是在结束时间=开始时间。如果需要考虑第二次,可能需要在时间增量计算中更改<代码> TooTrxSt2/<代码>位。再一次尝试样本数据,如果你愿意的话,我意识到了。文件读取的是“作业”而不是示例中的“任务”,并且在我编辑时任务名称创建丢失。我还将“header=None”更改为“header=0”,以说明示例数据中的列标签(否则它不会为我读取),但这可能与实际数据不同。我猜列名可能很奇怪?该值错误可能是索引或列问题。
Timeslot Start_Time End_Time Tasks_Running Task_Names
0 0 2018-10-15 13:30:00 2018-10-15 13:35:00 1.0 [RandomName0]
1 1 2018-10-15 13:35:00 2018-10-15 13:40:00 1.0 [RandomName0]
2 2 2018-10-15 13:40:00 2018-10-15 13:45:00 2.0 [RandomName0, RandomName1]
3 3 2018-10-15 13:45:00 2018-10-15 13:50:00 2.0 [RandomName0, RandomName1]
4 4 2018-10-15 13:50:00 2018-10-15 13:55:00 2.0 [RandomName1, RandomName2]
5 5 2018-10-15 13:55:00 2018-10-15 14:00:00 2.0 [RandomName1, RandomName2]
6 6 2018-10-15 14:00:00 2018-10-15 14:05:00 0.0 NaN
7 7 2018-10-15 14:05:00 2018-10-15 14:10:00 0.0 NaN
8 8 2018-10-15 14:10:00 2018-10-15 14:15:00 1.0 [RandomName3]
9 9 2018-10-15 14:15:00 2018-10-15 14:20:00 1.0 [RandomName3]
10 10 2018-10-15 14:20:00 2018-10-15 14:25:00 1.0 [RandomName4]
11 11 2018-10-15 14:25:00 2018-10-15 14:30:00 1.0 [RandomName4]
12 12 2018-10-15 14:30:00 2018-10-15 14:35:00 1.0 [RandomName5]
13 13 2018-10-15 14:35:00 2018-10-15 14:35:00 1.0 [RandomName5]
df['Start'] = pd.to_datetime(df['Start'])
df['Finish'] = pd.to_datetime(df['Finish'])
df['Segments'] = df.index.map(lambda x: pd.date_range(start=df['Start'][x], end=df['Finish'][x], freq='5Min'))
df1 = pd.DataFrame([(d, t.Task) for t in df.itertuples() for d in t.Segments])
df1 = df1.rename(columns={0:'Time', 1:'Task'})
grouped = df1.groupby(['Time'])
for time, group in grouped:
print(group)
#Copying your original dataframe into clipboard buffer
df = pd.read_clipboard(index_col='Index')
df[['Start', 'Finish']] = df[['Start','Finish']].apply(pd.to_datetime)
df_out = df.apply(lambda x: pd.Series(pd.date_range(x.Start, x.Finish, freq='5T')), axis=1)\
.stack()\
.value_counts(bins=pd.date_range(df.Start.min(), df.Finish.max(), freq='5T'))\
.sort_index()
df_out.index = pd.MultiIndex.from_tuples(df_out.index.to_tuples())
df_out = df_out.rename_axis(['Start', 'Finish']).rename('Task Running').reset_index()
print(df_out)
df_out.plot('Start','Task Running')
Start Finish Task Running
0 2018-10-15 13:29:59.999999999 2018-10-15 13:35:00 2
1 2018-10-15 13:35:00.000000000 2018-10-15 13:40:00 2
2 2018-10-15 13:40:00.000000000 2018-10-15 13:45:00 1
3 2018-10-15 13:45:00.000000000 2018-10-15 13:50:00 2
4 2018-10-15 13:50:00.000000000 2018-10-15 13:55:00 0
5 2018-10-15 13:55:00.000000000 2018-10-15 14:00:00 0
6 2018-10-15 14:00:00.000000000 2018-10-15 14:05:00 0
7 2018-10-15 14:05:00.000000000 2018-10-15 14:10:00 1
8 2018-10-15 14:10:00.000000000 2018-10-15 14:15:00 0
9 2018-10-15 14:15:00.000000000 2018-10-15 14:20:00 1
10 2018-10-15 14:20:00.000000000 2018-10-15 14:25:00 0
11 2018-10-15 14:25:00.000000000 2018-10-15 14:30:00 1