Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/svn/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pandas—跨越多个X分钟时间段的任务计数_Python_Pandas - Fatal编程技术网

Python Pandas—跨越多个X分钟时间段的任务计数

Python Pandas—跨越多个X分钟时间段的任务计数,python,pandas,Python,Pandas,考虑以下数据: Index Task Start Finish 0 RandomName 2018-10-15T13:30:00+00:00 2018-10-15T13:41:00+00:00 1 RandomName 2018-10-15T13:40:00+00:00 2018-10-15T13:51:00+00:00 2 RandomName 2018-10-15T13:50:00

考虑以下数据:

Index   Task        Start                       Finish
0       RandomName  2018-10-15T13:30:00+00:00   2018-10-15T13:41:00+00:00
1       RandomName  2018-10-15T13:40:00+00:00   2018-10-15T13:51:00+00:00
2       RandomName  2018-10-15T13:50:00+00:00   2018-10-15T13:51:00+00:00
3       RandomName  2018-10-15T14:10:00+00:00   2018-10-15T14:11:00+00:00
4       RandomName  2018-10-15T14:20:00+00:00   2018-10-15T14:21:00+00:00
5       RandomName  2018-10-15T14:30:00+00:00   2018-10-15T14:31:00+00:00
我试图做的是生成这个数据帧的5分钟段(时隙类型),计算这些任务在所述段中发生的次数,并尝试将其可视化。由于这些任务具有持续时间,我首先必须通过以下方式生成分段:

import pandas as pd
from datetime import datetime, timedelta

def main():

   input_file = "input.csv"    
   df = pd.read_csv(
                input_file
                ,parse_dates=['Start','Finish']
                ,names=['Index', 'Job', 'Start', 'Finish']
                ,index_col='Index'
                ,header=None
                )

    # Find the duration of each task.
    df['Start']  = pd.to_datetime(df['Start'],dayfirst=True, errors='coerce')
    df['Finish'] = pd.to_datetime(df['Finish'],dayfirst=True, errors='coerce')
    df.loc[:,'Duration'] = df['Finish'].dt.minute - df['Start'].dt.minute

    # Define the range and split it into 5 minute segments
    rng_min = df['Start'].min()  # Earliest Date
    rng_max = df['Finish'].max() # Latest Date
    current = rng_min
    while current < rng_max:
         current += timedelta(minutes=5)

if __name__ == "__main__":
     main()
您可以在时间序列索引上使用后跟来执行所需操作:

重采样允许您更改日期时间索引的频率。在这种情况下,您希望“向上采样”——增加数据中的步骤数 然后,Reindex允许您使用NAs填补空白


是一种有用的数据分割方法。使用date_range函数以5分钟的频率为列指定分段时间。展开此列,使用itertuples()创建一个新的数据帧,它将遍历数据帧的每一行。从这里,您可以对数据运行groupby函数,或者根据需要对其进行更改

    df['Start'] = pd.to_datetime(df['Start'])
    df['Finish'] = pd.to_datetime(df['Finish'])
    df['Segments'] = df.index.map(lambda x: pd.date_range(start=df['Start'][x], end=df['Finish'][x], freq='5Min'))
    df1 = pd.DataFrame([(d, t.Task) for t in df.itertuples() for d in t.Segments])
    df1 = df1.rename(columns={0:'Time', 1:'Task'})
    grouped = df1.groupby(['Time'])
    for time, group in grouped:
        print(group)

您可以尝试类似的方法:

#Copying your original dataframe into clipboard buffer
df = pd.read_clipboard(index_col='Index')

df[['Start', 'Finish']] = df[['Start','Finish']].apply(pd.to_datetime)

df_out = df.apply(lambda x: pd.Series(pd.date_range(x.Start, x.Finish, freq='5T')), axis=1)\
  .stack()\
  .value_counts(bins=pd.date_range(df.Start.min(), df.Finish.max(), freq='5T'))\
  .sort_index()

df_out.index = pd.MultiIndex.from_tuples(df_out.index.to_tuples())

df_out = df_out.rename_axis(['Start', 'Finish']).rename('Task Running').reset_index()
print(df_out)

df_out.plot('Start','Task Running')
输出(注:区间开始或结束的包容性不明确,即13:35的值应包含在区间结束或下一个区间开始时):

可视化输出:


您的预期输出是什么样的?在帖子中添加-谢谢!它可以按如下方式执行:(a)创建一个字典,其中键为开始时间、开始时间+5、开始时间+10分钟,直到覆盖最后一条记录的开始时间。(b) 将每个任务的完成时间与每个键进行比较。如果大于键值,则将其附加为列表值。因此,您将有{'start_window1':['T1','T2'],'start_window2':['T1','T2']…}等等,其中T1,T2是任务名称(c)计算与键对应的每个列表的长度,以便获得所需的答案,嗨,Katelie,谢谢!我尝试了您的代码,但有一个错误:段持续时间从5分钟更改为10分钟(在您的回复示例中也是如此)。很抱歉,您需要使用
重新采样
,然后使用
重新索引
,来创建所有5分钟的间隔。我编辑了我的答案来更正它。谢谢你,凯特利,你不知道你的帮助有多感激!不幸的是,我在到达重采样部分时仍然遇到错误:ValueError:无法从重复的轴[第40行]重新索引,该轴为“.apply”(lambda x:x.resample(rule={interval}T.)。format(interval=minutes\u per_segment),label='right',closed='right')`.你能帮我一下吗?是因为任务名称不唯一吗?嗯…你有开始时间和结束时间相同的行吗(特别是分钟)它对我来说是非唯一任务名称和直行重复行的组合,但不是在结束时间=开始时间。如果需要考虑第二次,可能需要在时间增量计算中更改<代码> TooTrxSt2/<代码>位。再一次尝试样本数据,如果你愿意的话,我意识到了。文件读取的是“作业”而不是示例中的“任务”,并且在我编辑时任务名称创建丢失。我还将“header=None”更改为“header=0”,以说明示例数据中的列标签(否则它不会为我读取),但这可能与实际数据不同。我猜列名可能很奇怪?该值错误可能是索引或列问题。
    Timeslot          Start_Time            End_Time  Tasks_Running                    Task_Names 
0          0 2018-10-15 13:30:00 2018-10-15 13:35:00            1.0                 [RandomName0] 
1          1 2018-10-15 13:35:00 2018-10-15 13:40:00            1.0                 [RandomName0] 
2          2 2018-10-15 13:40:00 2018-10-15 13:45:00            2.0    [RandomName0, RandomName1] 
3          3 2018-10-15 13:45:00 2018-10-15 13:50:00            2.0    [RandomName0, RandomName1] 
4          4 2018-10-15 13:50:00 2018-10-15 13:55:00            2.0    [RandomName1, RandomName2] 
5          5 2018-10-15 13:55:00 2018-10-15 14:00:00            2.0    [RandomName1, RandomName2] 
6          6 2018-10-15 14:00:00 2018-10-15 14:05:00            0.0                           NaN 
7          7 2018-10-15 14:05:00 2018-10-15 14:10:00            0.0                           NaN 
8          8 2018-10-15 14:10:00 2018-10-15 14:15:00            1.0                 [RandomName3] 
9          9 2018-10-15 14:15:00 2018-10-15 14:20:00            1.0                 [RandomName3] 
10        10 2018-10-15 14:20:00 2018-10-15 14:25:00            1.0                 [RandomName4] 
11        11 2018-10-15 14:25:00 2018-10-15 14:30:00            1.0                 [RandomName4] 
12        12 2018-10-15 14:30:00 2018-10-15 14:35:00            1.0                 [RandomName5] 
13        13 2018-10-15 14:35:00 2018-10-15 14:35:00            1.0                 [RandomName5] 
    df['Start'] = pd.to_datetime(df['Start'])
    df['Finish'] = pd.to_datetime(df['Finish'])
    df['Segments'] = df.index.map(lambda x: pd.date_range(start=df['Start'][x], end=df['Finish'][x], freq='5Min'))
    df1 = pd.DataFrame([(d, t.Task) for t in df.itertuples() for d in t.Segments])
    df1 = df1.rename(columns={0:'Time', 1:'Task'})
    grouped = df1.groupby(['Time'])
    for time, group in grouped:
        print(group)
#Copying your original dataframe into clipboard buffer
df = pd.read_clipboard(index_col='Index')

df[['Start', 'Finish']] = df[['Start','Finish']].apply(pd.to_datetime)

df_out = df.apply(lambda x: pd.Series(pd.date_range(x.Start, x.Finish, freq='5T')), axis=1)\
  .stack()\
  .value_counts(bins=pd.date_range(df.Start.min(), df.Finish.max(), freq='5T'))\
  .sort_index()

df_out.index = pd.MultiIndex.from_tuples(df_out.index.to_tuples())

df_out = df_out.rename_axis(['Start', 'Finish']).rename('Task Running').reset_index()
print(df_out)

df_out.plot('Start','Task Running')
                           Start              Finish  Task Running
0  2018-10-15 13:29:59.999999999 2018-10-15 13:35:00             2
1  2018-10-15 13:35:00.000000000 2018-10-15 13:40:00             2
2  2018-10-15 13:40:00.000000000 2018-10-15 13:45:00             1
3  2018-10-15 13:45:00.000000000 2018-10-15 13:50:00             2
4  2018-10-15 13:50:00.000000000 2018-10-15 13:55:00             0
5  2018-10-15 13:55:00.000000000 2018-10-15 14:00:00             0
6  2018-10-15 14:00:00.000000000 2018-10-15 14:05:00             0
7  2018-10-15 14:05:00.000000000 2018-10-15 14:10:00             1
8  2018-10-15 14:10:00.000000000 2018-10-15 14:15:00             0
9  2018-10-15 14:15:00.000000000 2018-10-15 14:20:00             1
10 2018-10-15 14:20:00.000000000 2018-10-15 14:25:00             0
11 2018-10-15 14:25:00.000000000 2018-10-15 14:30:00             1