Python 熊猫:按时间间隔的另一个数据帧过滤数据帧

Python 熊猫:按时间间隔的另一个数据帧过滤数据帧,python,pandas,dataframe,time,filter,Python,Pandas,Dataframe,Time,Filter,如果我有一个数据帧(df_数据),如: 我想用另一个间隔数据帧(df_间隔)对其进行过滤,如: 最有效的方法是什么?我有一个很大的数据集,如果我尝试像这样迭代它: for i in range(len(intervals)): df_filtered = df[df['Time'].between(intervals['start'][i], intervals['end'][i]) ... ... ... 这需要永远!我知道我不应该迭代大数据帧,但我不知道如何

如果我有一个数据帧(df_数据),如:

我想用另一个间隔数据帧(df_间隔)对其进行过滤,如:

最有效的方法是什么?我有一个很大的数据集,如果我尝试像这样迭代它:

for i in range(len(intervals)):
    df_filtered = df[df['Time'].between(intervals['start'][i], intervals['end'][i])
    ...
    ...
    ...
这需要永远!我知道我不应该迭代大数据帧,但我不知道如何通过第二个数据帧上的每个间隔对其进行过滤

我正在尝试的步骤是:

1-从df_间隔中获取所有间隔(开始/结束列)

2-使用这些时间间隔创建一个新的数据帧(df_stats),其中包含这些时间范围内列的统计信息。例如:

      start                  end             ID    X_max    X_min    X_mean    Y_max    Y_min    Y_mean    ....
2020-02-03 18:11:59   2020-02-03 18:42:00    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    03    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    04    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 18:11:59   2020-02-03 18:42:00    05    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    01    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    02    ...    ...    ...     ...   ...    ...    ...     ...
2020-02-03 19:36:59   2020-02-03 20:06:59    03    ...    ...    ...     ...   ...    ...    ...     ...

以下是完成此操作的完整代码。我试图创建一些示例数据,看看这是否有效。请对完整的数据集运行此操作,并查看是否提供了所需的结果

  • 步骤1:创建临时列表以存储临时数据帧

    temp\u list=[]

  • 步骤2:迭代数据帧2。对于选定的每一行,执行以下操作: 以下:

    • 从dataframe 1中筛选开始日期和结束日期的行

      temp=df1[df1.Time.between(行开始,行结束)]

    • Groupby ID并获取X、Y、Z和H的统计值。每列一组

      x=temp.groupby('ID'['x'].agg(['min','max','mean','median'])。添加前缀('x'')。重置索引()

    • 将所有X、Y、Z、H项合并到一个数据帧中

    • 将开始和结束日期添加到合并的数据帧

    • 将数据帧附加到临时列表

  • 步骤3:使用临时列表创建最终数据帧

  • 步骤4:根据需要重新排列列。开始和结束日期为前两列,然后是ID,然后是X值、Y值、Z值,最后是H值

  • 步骤5:打印数据框

  • 完成此操作的完整代码:

    c1 = ['ID','Time','X','Y','Z','H']
    d1 = [
    ['01','2020-02-03 18:13:16',0.011,0.012,0.013,0.014],
    ['01','2020-02-03 18:13:21',0.015,0.016,0.017,0.018],
    ['01','2020-02-03 18:13:26',0.013,0.013,0.013,0.013],
    ['01','2020-02-03 18:13:31',0.015,0.015,0.015,0.015],
         
    ['02','2020-02-03 18:13:16',0.021,0.022,0.023,0.024],
    ['02','2020-02-03 18:13:21',0.025,0.026,0.027,0.028],
    ['02','2020-02-03 18:13:26',0.023,0.023,0.023,0.023],
    ['02','2020-02-03 18:13:31',0.025,0.025,0.025,0.025],
         
    ['03','2020-02-03 18:13:16',0.031,0.032,0.033,0.034],
    ['03','2020-02-03 18:13:21',0.035,0.036,0.037,0.038],
    ['03','2020-02-03 18:13:26',0.033,0.033,0.033,0.033],
    ['03','2020-02-03 18:13:31',0.035,0.035,0.035,0.035],
    
    ['04','2020-02-03 18:13:16',0.041,0.042,0.043,0.044],
    ['04','2020-02-03 18:13:21',0.045,0.046,0.047,0.048],
    ['04','2020-02-03 18:13:26',0.043,0.043,0.043,0.043],
    ['04','2020-02-03 18:13:31',0.045,0.045,0.045,0.045],
         
    ['05','2020-02-03 18:13:16',0.055,0.047,0.039,0.062],
    ['05','2020-02-03 18:13:21',0.063,0.063,0.055,0.079],
    ['05','2020-02-03 18:13:26',0.063,0.063,0.063,0.079],
    ['05','2020-02-03 18:13:31',0.095,0.102,0.079,0.127],
         
    ['01','2020-02-03 20:03:16',0.011,0.012,0.013,0.014],
    ['01','2020-02-03 20:03:21',0.015,0.016,0.017,0.018],
    ['01','2020-02-03 20:03:26',0.013,0.013,0.013,0.013],
    ['01','2020-02-03 20:03:31',0.015,0.015,0.015,0.015],
         
    ['02','2020-02-03 20:03:16',0.021,0.022,0.023,0.024],
    ['02','2020-02-03 20:03:21',0.025,0.026,0.027,0.028],
    ['02','2020-02-03 20:03:26',0.023,0.023,0.023,0.023],
    ['02','2020-02-03 20:03:31',0.025,0.025,0.025,0.025],
         
    ['03','2020-02-03 20:03:16',0.031,0.032,0.033,0.034],
    ['03','2020-02-03 20:03:21',0.035,0.036,0.037,0.038],
    ['03','2020-02-03 20:03:26',0.033,0.033,0.033,0.033],
    ['03','2020-02-03 20:03:31',0.035,0.035,0.035,0.035],
    
    ['04','2020-02-03 20:03:16',0.041,0.042,0.043,0.044],
    ['04','2020-02-03 20:03:21',0.045,0.046,0.047,0.048],
    ['04','2020-02-03 20:03:26',0.043,0.043,0.043,0.043],
    ['04','2020-02-03 20:03:31',0.045,0.045,0.045,0.045],
         
    ['05','2020-02-03 20:03:16',0.055,0.047,0.039,0.062],
    ['05','2020-02-03 20:03:21',0.063,0.063,0.055,0.079],
    ['05','2020-02-03 20:03:26',0.063,0.063,0.063,0.079],
    ['05','2020-02-03 20:03:31',0.095,0.102,0.079,0.127],
         
    ['01','2020-07-01 08:59:43',0.063,0.063,0.047,0.079],
    ['01','2020-07-01 08:59:48',0.055,0.055,0.055,0.079],
    ['01','2020-07-01 08:59:53',0.071,0.063,0.055,0.082],
    ['01','2020-07-01 08:59:58',0.063,0.063,0.047,0.082],
    ['01','2020-07-01 08:59:59',0.047,0.047,0.047,0.071]]
    
    import pandas as pd
    df1 = pd.DataFrame(d1,columns=c1)
    df1.Time = pd.to_datetime(df1.Time)
    
    c2 = ['int_id','start','end']
    d2 = [[1,'2020-02-03 18:11:59','2020-02-03 18:42:00'],
    [2,'2020-02-03 19:36:59','2020-02-03 20:06:59'],
    [3,'2020-02-03 21:00:59','2020-02-03 21:31:00'],
    [4,'2020-02-03 22:38:00','2020-02-03 23:08:00'],
    [5,'2020-02-04 05:55:00','2020-02-04 06:24:59'],
    [1804,'2021-01-10 13:50:00','2021-01-10 14:20:00'],
    [1805,'2021-01-10 18:10:00','2021-01-10 18:40:00'],
    [1806,'2021-01-10 19:40:00','2021-01-10 20:10:00'],
    [1807,'2021-01-10 21:25:00','2021-01-10 21:55:00'],
    [1808,'2021-01-10 22:53:00','2021-01-10 23:23:00']]
    
    import pandas as pd
    from functools import reduce
    
    df2 = pd.DataFrame(d2,columns=c2)
    
    df2.start = pd.to_datetime(df2.start)
    df2.end = pd.to_datetime(df2.end)
    
    temp_list = []
    
    for i, row in df2.iterrows():
    
        temp = df1[df1.Time.between(row.start,row.end)]
    
        x = temp.groupby('ID')['X'].agg(['min','max','mean','median']).add_prefix('X_').reset_index()
        y = temp.groupby('ID')['Y'].agg(['min','max','mean','median']).add_prefix('Y_').reset_index()
        z = temp.groupby('ID')['Z'].agg(['min','max','mean','median']).add_prefix('Z_').reset_index()
        h = temp.groupby('ID')['H'].agg(['min','max','mean','median']).add_prefix('H_').reset_index()
    
        data_frames = [x,y,z,h]
    
        df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
                                how='outer'), data_frames).fillna('void')
    
        df_merged['start'] = row.start
        df_merged['end'] = row.end
        
        temp_list.append(df_merged)
    
    
    df_final = pd.concat(temp_list, ignore_index=True)
    
    column_names = ['start','end','ID',
                        'X_min','X_max','X_mean','X_median',
                        'Y_min','Y_max','Y_mean','Y_median',
                        'Z_min','Z_max','Z_mean','Z_median',
                        'H_min','H_max','H_mean','H_median']
    
    df_final = df_final[column_names]
    
    print (df_final)
    
    其输出将为:

                    start                 end  ID  ...  H_max   H_mean  H_median
    0 2020-02-03 18:11:59 2020-02-03 18:42:00  01  ...  0.018  0.01500    0.0145
    1 2020-02-03 18:11:59 2020-02-03 18:42:00  02  ...  0.028  0.02500    0.0245
    2 2020-02-03 18:11:59 2020-02-03 18:42:00  03  ...  0.038  0.03500    0.0345
    3 2020-02-03 18:11:59 2020-02-03 18:42:00  04  ...  0.048  0.04500    0.0445
    4 2020-02-03 18:11:59 2020-02-03 18:42:00  05  ...  0.127  0.08675    0.0790
    5 2020-02-03 19:36:59 2020-02-03 20:06:59  01  ...  0.018  0.01500    0.0145
    6 2020-02-03 19:36:59 2020-02-03 20:06:59  02  ...  0.028  0.02500    0.0245
    7 2020-02-03 19:36:59 2020-02-03 20:06:59  03  ...  0.038  0.03500    0.0345
    8 2020-02-03 19:36:59 2020-02-03 20:06:59  04  ...  0.048  0.04500    0.0445
    9 2020-02-03 19:36:59 2020-02-03 20:06:59  05  ...  0.127  0.08675    0.0790
    

    如果Joe的答案没有给你你想要的速度,我认为可以通过消除for循环中的统计数据计算来提高速度。(我偷了他的df创建,因为他是在他的答案中加入这一点的英雄。)理想情况下,你也可以摆脱for循环,但我认为时间戳索引是重复的(跨越ID号)合并这两个数据帧可能很棘手

    下面是我仍然使用迭代来处理开始/结束时间的尝试。首先,我将int_id应用于父df。我想将其添加到父数据帧中,这样我就可以“groupby”,而无需创建“temp”数据帧并对其进行统计

    for index, row in df2.iterrows():
        
        df1.loc[df1.Time.between(row.start,row.end), 'int_id'] = row.int_id
    
        ID                Time      X      Y      Z      H  int_id
    0   01 2020-02-03 18:13:16  0.011  0.012  0.013  0.014     1.0
    1   01 2020-02-03 18:13:21  0.015  0.016  0.017  0.018     1.0
    2   01 2020-02-03 18:13:26  0.013  0.013  0.013  0.013     1.0
    3   01 2020-02-03 18:13:31  0.015  0.015  0.015  0.015     1.0
    4   02 2020-02-03 18:13:16  0.021  0.022  0.023  0.024     1.0
    5   02 2020-02-03 18:13:21  0.025  0.026  0.027  0.028     1.0
    6   02 2020-02-03 18:13:26  0.023  0.023  0.023  0.023     1.0
    
    然后我定义了聚合,以便在循环完成后一次完成所有操作

    aggs = {'X':['sum', 'max', 'mean', 'median'], 
            'Y':['sum', 'max', 'mean', 'median'], 
            'Z':['sum', 'max', 'mean', 'median'], 
            'H':['sum', 'max', 'mean', 'median']}
    
    df2 = df1.groupby(by=('int_id')).agg(aggs)
    
                X                            Y                             Z                            H                        
              sum    max    mean median    sum    max     mean median    sum    max    mean median    sum    max     mean  median
    int_id                                                                                                                       
    1.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345
    2.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345
    
    注意:这里有一个关于列的多索引。您可以使用以下方法将它们连接起来

    df_final.columns = ['_'.join(col).strip() for col in df_final.columns.values]
    
            X_sum  X_max  X_mean  X_median  Y_sum  Y_max   Y_mean  Y_median  Z_sum  Z_max  Z_mean  Z_median  H_sum  H_max   H_mean  H_median
    int_id                                                                                                                                  
    1.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345
    2.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345
    

    首先,我将标记数据帧。大数据帧将是df1。筛选器间隔数据帧将是df2。让我们以df2数据帧中的第一行为例。您有
    2020-02-03 18:11:59 2020-02-03 18:42:00
    。您想对这两个值做什么?将它们用作筛选器,并从df1数据帧中获取所有行来执行这些操作统计操作?你能解释一下细节吗?这样我们就可以了解你想做什么了步骤2:一旦你过滤并获取每一行的值,你希望输出存储在哪里?在df1或df2中?它不能是df1,因为这将导致行和列之间的重复值。如果它必须存储在df1中?它应该是一个新列吗?I因此,您将有1808个新列乘以要使用的统计列数。在示例中,您有mean、median、max、min等(至少4)=1808 x 4。如果为df2,则4x(x、Y、Z、H)乘以统计值(至少4)。因此,您能否在问题中更清楚地说明您希望解决方案看起来像什么?Hi@JoeFerndz!感谢您的回答!df1作为我的数据数据帧(包含5个不同传感器的数据),df2作为我的“筛选数据库”:1-获取df2的第一行(间隔id,开始,结束)2-使用“开始”和“结束”筛选df1(其中有5个不同id的数据)3-计算过滤df1的统计数据(我们称之为df_filtered)。从这个新df中,计算统计数据并将结果存储在df_统计数据(int_id,sensor id,max,min,mean)4-由于我有5个传感器和1808个间隔,我的最终df_统计数据应该有5列5x1808行(此示例只有一些统计信息)请在您的问题部分详细说明细节,以便每个人都能清楚地了解您在这里尝试做什么。因此,总体情况是:我有一个由5个传感器记录的数据的大数据帧,我需要使用第二个df上列出的时间间隔对其进行过滤,并计算这些范围内的数据统计信息……我的最终df should仅包含统计数据列(最大值、平均值、最小值、中值等)参考间隔ID和传感器ID,这样我就可以找到我正在分析的事件。谢谢你的帖子。向上投票。看到改进。同意它谢谢你们两位花时间帮助我!我对这种情况做了一些测试:我的df1有17174122行,我的df2有1786行……按照你们两位的建议迭代它们(如图所示)花了7分钟20秒!这比我目前正在做的有点好,但我想知道是否有其他方法更有效,或者这是我能得到的最好的方法,因为我的数据库很大!再次感谢你的帮助。在尝试加速之前,确定哪一部分最慢会很有用。我建议将你的代码分成四部分nctions。第一个加载到巨型数据帧中的。一个执行iterrows()的要添加组的部分将执行大数据帧。下一个将接受已添加组的数据帧,最后一个将保存输出。如果对这些函数运行cProfile,您可以看到谁是最慢的。@CorreyKoshnick抱歉,我不清楚我的答案!
    aggs = {'X':['sum', 'max', 'mean', 'median'], 
            'Y':['sum', 'max', 'mean', 'median'], 
            'Z':['sum', 'max', 'mean', 'median'], 
            'H':['sum', 'max', 'mean', 'median']}
    
    df2 = df1.groupby(by=('int_id')).agg(aggs)
    
                X                            Y                             Z                            H                        
              sum    max    mean median    sum    max     mean median    sum    max    mean median    sum    max     mean  median
    int_id                                                                                                                       
    1.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345
    2.0     0.732  0.095  0.0366  0.034  0.739  0.102  0.03695  0.034  0.708  0.079  0.0354  0.034  0.827  0.127  0.04135  0.0345
    
    df_final.columns = ['_'.join(col).strip() for col in df_final.columns.values]
    
            X_sum  X_max  X_mean  X_median  Y_sum  Y_max   Y_mean  Y_median  Z_sum  Z_max  Z_mean  Z_median  H_sum  H_max   H_mean  H_median
    int_id                                                                                                                                  
    1.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345
    2.0     0.732  0.095  0.0366     0.034  0.739  0.102  0.03695     0.034  0.708  0.079  0.0354     0.034  0.827  0.127  0.04135    0.0345