Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/316.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 调整时间戳的子集df_Python_Pandas_Sorting - Fatal编程技术网

Python 调整时间戳的子集df

Python 调整时间戳的子集df,python,pandas,sorting,Python,Pandas,Sorting,我试图通过计算时间戳来创建一个新的df。具体地说,对于下面的df,我首先返回Number中整数是前一行的diff的所有行 然后,我想根据以下两条规则调整这些时间戳: 如果数字中的整数增加,则将时间戳舍入到前一个15分钟标记 如果Number中的整数减少,请保留当前时间戳 我不确定这是否是最有效的方法,但我目前正在通过将两个单独的数据帧子集,然后合并来实现这一点。我返回所有增加的数字并修改时间戳,同时返回所有减少的数字并保持不变。当我把这两个合并在一起时,我就会陷入麻烦 如果整数差之间的差距很小,

我试图通过计算时间戳来创建一个新的
df
。具体地说,对于下面的
df
,我首先返回
Number
整数
是前一行的
diff
的所有行

然后,我想根据以下两条规则调整这些时间戳:

  • 如果
    数字中的整数增加,则将时间戳舍入到前一个15分钟标记
  • 如果
    Number
    中的整数减少,请保留当前时间戳
  • 我不确定这是否是最有效的方法,但我目前正在通过将两个单独的数据帧子集,然后合并来实现这一点。我返回所有增加的数字并修改时间戳,同时返回所有减少的数字并保持不变。当我把这两个合并在一起时,我就会陷入麻烦

    如果整数差之间的差距很小,则舍入可能导致序列不正确。从本质上讲,
    Number
    是不正确的,如果在减少的整数的15分钟内有一个增加的整数。因为它是四舍五入的,所以产生的时间戳放错了位置

    df = pd.DataFrame({
        'Time' : ['1/1/1900 8:00:00','1/1/1900 9:59:00','1/1/1900 10:10:00','1/1/1900 12:21:00','1/1/1900 12:26:00','1/1/1900 13:00:00','1/1/1900 13:26:00','1/1/1900 13:29:00','1/1/1900 14:20:00','1/1/1900 18:10:00'],                 
        'Number' : [1,1,2,2,3,2,1,2,1,1],                      
        })
    
    # First and last entry in df. This ensures the start/end of the subsequent
    # df includes rows where the 'Number' increases/decreases.
    first_time = df.loc[0,'Time']
    last_time = df.loc[df.index[-1], 'Time']
    
    # Insert 0 prior to first race
    df.loc[-1] = [first_time, 0]  
    df.index = df.index + 1  
    df.sort_index(inplace=True) 
    
    # Insert 0 after the last race
    df.loc[len(df)] = last_time, 0
    
    # Convert to datetime. Include new column that rounds all timestamps. If timestamp
    # is within 10mins of nearest 15min, round to that point.
    df['Time'] = pd.to_datetime(df['Time'])
    df['New Time'] = df['Time'].sub(pd.Timedelta(11*60, 's')).dt.floor(freq='15T')
    
    # Create separate df's. Inc contains all increased integers. Dec contains
    # all decreases in integers  
    df = df[df['Number'] != df['Number'].shift()]
    Inc = df[df['Number'] > df['Number'].shift()]
    Dec = df[df['Number'] < df['Number'].shift()]
    
    del Inc['Time']
    del Dec['New Time']
    Inc.columns = ['Number','Time']
    
    # Merge df's
    df1 = pd.concat([Inc,Dec], sort = True)
    
    # Sort so it's time ordered
    df1['Time'] = pd.to_datetime(df1['Time'])
    df1 = df1.iloc[pd.to_timedelta(df1['Time']).argsort()]
    

    输出:

        Number                Time
    1        1 1900-01-01 07:45:00
    3        2 1900-01-01 09:45:00
    5        3 1900-01-01 12:15:00
    6        2 1900-01-01 13:00:00
    8        2 1900-01-01 13:15:00 *Was previously 13:29:00
    7        1 1900-01-01 13:26:00 *To be removed because within 15 of previous row
    9        1 1900-01-01 14:20:00
    11       0 1900-01-01 18:10:00
    
    预期产出:

        Number                Time
    1        1 1900-01-01 07:45:00
    3        2 1900-01-01 09:45:00
    5        3 1900-01-01 12:15:00
    6        2 1900-01-01 13:00:00
    8        2 1900-01-01 13:15:00
    9        1 1900-01-01 14:20:00
    11       0 1900-01-01 18:10:00
    
       Number                Time
    0       1 1900-01-01 07:45:00
    1       2 1900-01-01 09:30:00
    2       2 1900-01-01 10:00:00
    3       2 1900-01-01 10:13:00
    4       1 1900-01-01 12:26:00
    6       2 1900-01-01 13:00:00
    7       2 1900-01-01 13:45:00
    8       3 1900-01-01 14:00:00 #Index 8 in df has an increase at 14:21. Should be rounded up to 14:00 and Number should be 3
    9       4 1900-01-01 14:15:00 
    
    编辑2:

    当连续15分钟的时间段增加时,我遇到了麻烦。它似乎错过了第一次增加,而只是返回第二次增加

    df = pd.DataFrame({
        'Time' : ['1/1/1900 8:00:00','1/1/1900 9:49:00','1/1/1900 10:00:00','1/1/1900 10:13:00','1/1/1900 12:26:00','1/1/1900 13:00:00','1/1/1900 13:22:00','1/1/1900 13:45:00','1/1/1900 14:21:00','1/1/1900 14:36:00'],                 
        'Number' : [1,2,2,2,1,1,2,2,3,4],                      
        })
    
    # if you Time column is not of type datetime64, please execute the following line:
    df['Time']= df['Time'].astype('datetime64')
    
    # add some auxillary columns
    df['row_id']= df.index                                         # this is needed for the delete indexer to avoid deleting adjusted rows that are joined with itself
    df['increase']= df['Number'] > df['Number'].shift(1).fillna(0) # this is to identify the rows where the value increases and fillna(0) makes sure the value of the first row is regarded as an increase if it is larger than 0
    df['Adjusted Time']= df['Time'].where(~df['increase'], df['Time'].sub(pd.Timedelta(11*60, 's')).dt.floor('15min')) # the Adjusted Time is the time we want to display later and also forms a range to delete (we want to delete other records later, if they lie between "Adjusted Time" and "Time"
    
    # merge the ranges to identify the rows, we need to delete
    get_delete_ranges= df[df['Time'] > df['Adjusted Time']]        # those are the ranges, for which we have to look if there is something else inbetween
    df_with_del_ranges= pd.merge_asof(df, get_delete_ranges, left_on='Time', right_on='Adjusted Time', tolerance=pd.Timedelta('15m'), suffixes=['', '_del'])
    
    # create an indexer for the rows to delete
    del_row= (df_with_del_ranges['row_id_del'] != df_with_del_ranges['row_id']) & (df_with_del_ranges['Time'] >= df_with_del_ranges['Adjusted Time_del']) & (df_with_del_ranges['Time'] <= df_with_del_ranges['Time_del'])
    
    # delete the rows in the overlapping ranges
    df_with_del_ranges.drop(df_with_del_ranges[del_row].index, axis='index', inplace=True)
    # remove the auxillary columns and restore the originals column names
    df_with_del_ranges.drop([col for col in df_with_del_ranges if col not in ['People', 'Adjusted Time']], axis='columns', inplace=True)
    df_with_del_ranges.rename({'Adjusted Time': 'Time'}, axis='columns', inplace=True)
    
    预期产出:

        Number                Time
    1        1 1900-01-01 07:45:00
    3        2 1900-01-01 09:45:00
    5        3 1900-01-01 12:15:00
    6        2 1900-01-01 13:00:00
    8        2 1900-01-01 13:15:00
    9        1 1900-01-01 14:20:00
    11       0 1900-01-01 18:10:00
    
       Number                Time
    0       1 1900-01-01 07:45:00
    1       2 1900-01-01 09:30:00
    2       2 1900-01-01 10:00:00
    3       2 1900-01-01 10:13:00
    4       1 1900-01-01 12:26:00
    6       2 1900-01-01 13:00:00
    7       2 1900-01-01 13:45:00
    8       3 1900-01-01 14:00:00 #Index 8 in df has an increase at 14:21. Should be rounded up to 14:00 and Number should be 3
    9       4 1900-01-01 14:15:00 
    

    请尝试以下代码:

    # if you want the last time in your dataframe to be zero, just execute the following line (as this is equivalent to adding a new column and deleting the old one):
    df.iloc[-1, 1]= 0
    
    # if you Time column is not of type datetime64, please execute the following line:
    df['Time']= df['Time'].astype('datetime64')
    
    # add some auxillary columns
    df['row_id']= df.index                                         # this is needed for the delete indexer to avoid deleting adjusted rows that are joined with itself
    df['increase']= df['Number'] > df['Number'].shift(1).fillna(0) # this is to identify the rows where the value increases and fillna(0) makes sure the value of the first row is regarded as an increase if it is larger than 0
    df['Adjusted Time']= df['Time'].where(~df['increase'], df['Time'].sub(pd.Timedelta(11*60, 's')).dt.floor('15min')) # the Adjusted Time is the time we want to display later and also forms a range to delete (we want to delete other records later, if they lie between "Adjusted Time" and "Time"
    
    # merge the ranges to identify the rows, we need to delete
    get_delete_ranges= df[df['Time'] > df['Adjusted Time']]        # those are the ranges, for which we have to look if there is something else inbetween
    df_with_del_ranges= pd.merge_asof(df, get_delete_ranges, left_on='Time', right_on='Adjusted Time', tolerance=pd.Timedelta('15m'), suffixes=['', '_del'])
    
    # create an indexer for the rows to delete
    del_row= (df_with_del_ranges['row_id_del'] != df_with_del_ranges['row_id']) & (df_with_del_ranges['Time'] >= df_with_del_ranges['Adjusted Time_del']) & (df_with_del_ranges['Time'] <= df_with_del_ranges['Time_del'])
    
    # delete the rows in the overlapping ranges
    df_with_del_ranges.drop(df_with_del_ranges[del_row].index, axis='index', inplace=True)
    # remove the auxillary columns and restore the originals column names
    df_with_del_ranges.drop([col for col in df_with_del_ranges if col not in ['Number', 'Adjusted Time']], axis='columns', inplace=True)
    df_with_del_ranges.rename({'Adjusted Time': 'Time'}, axis='columns', inplace=True)
    

    如果没有
    .loc[-1,1]=0
    ,最后一行的
    Number
    列将包含1。

    抱歉。修正我已经包括了这个@jezrael。这个问题有意义吗?这有点难以描述。嗨,你能添加输入值吗,你给出了预期的输出?很好的任务。我喜欢:-)刚刚认识到您在代码中应用了11分钟的调整,并相应地更新了我的代码。非常感谢@jottbe。这个问题困扰了我好几个小时。只是一个简单的问题。我有一个边缘案例,它不会返回连续15分钟时间段内发生的增加。它忽略了第一次增加,只是返回第二次增加。我在问题中加入了一个新的编辑,显示了这一点。好吧,我知道,发生了什么。从14:36的最后一个记录是从3增加到4。所以根据规则,减去11分钟,结果四舍五入到14:15。在这种情况下,需要调整(删除)14:15和14:36之间的行。从我对这个调整的描述的解释。看来这种解释是错误的。但要改变这一点并不是什么大问题。您只需更改
    del_行
    索引器,并将
    'Time'
    替换为
    'Adjusted Time'
    (位于不等式左侧),同时保持
    '…\u del'
    字段不变。然后,您需要一个额外的……聚合步骤,因为基于
    “调整的时间”
    的删除逻辑可能会生成具有完全相同的
    “调整的时间”
    的行,如果两行都位于一个小的时间范围内,这两行都表示增加,并且都四舍五入到同一时间。我猜在这种情况下,您不希望同时拥有两条输出线,对吗?
    In [131]: df_with_del_ranges
    Out[131]: 
       Number                Time
    0       1 1900-01-01 07:45:00
    2       2 1900-01-01 09:45:00
    4       3 1900-01-01 12:15:00
    5       2 1900-01-01 13:00:00
    7       2 1900-01-01 13:15:00
    8       1 1900-01-01 14:20:00
    9       0 1900-01-01 18:10:00