Python 3.x 根据自定义条件连接并向前和/或向后填充

Python 3.x 根据自定义条件连接并向前和/或向后填充,python-3.x,pandas,algorithm,dataframe,datetime,Python 3.x,Pandas,Algorithm,Dataframe,Datetime,ref\u df中的所有日期都出现在ref\u df的ref\u日期中,反之亦然。对应于df中的每个date,我需要根据以下逻辑从ref\u df获取ref\u date: 如果一个日期重复了不止一次,并且上一个或下一个参考日期缺失,则从重复的日期的边缘分配到最近缺失的上一个或下一个参考日期 如果一个日期重复了不止一次,但没有缺少上一个/下一个参考日期,则参考日期与日期相同 df中可能缺少未包含的ref\u日期。当给定的参考日期前后没有重复填写时,就会发生这种情况 示例: >>>

ref\u df
中的所有
日期都出现在
ref\u df
ref\u日期中,反之亦然。对应于
df
中的每个
date
,我需要根据以下逻辑从
ref\u df
获取
ref\u date

  • 如果一个
    日期
    重复了不止一次,并且上一个或下一个
    参考日期缺失,则从重复的
    日期
    的边缘分配到最近缺失的上一个或下一个
    参考日期
  • 如果一个
    日期
    重复了不止一次,但没有缺少上一个/下一个
    参考日期
    ,则
    参考日期
    日期
    相同
  • df
    中可能缺少未包含的
    ref\u日期。当给定的参考日期前后没有重复填写时,就会发生这种情况
  • 示例:

    >>> import pandas as pd
    >>> from datetime import datetime as dt
    >>> df = pd.DataFrame({'date':[dt(2020,1,20), dt(2020,1,20), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,3,18), dt(2020,4,9), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,5,28), dt(2020,6,1), dt(2020,6,1), dt(2020,6,1), dt(2020,6,28), dt(2020,6,28)], 'qty':range(18)})
    >>> ref_df = pd.DataFrame({'ref_date':[dt(2019,12,8), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,4,9), dt(2020,4,10), dt(2020,4,12), dt(2020,4,13), dt(2020,4,14), dt(2020,5,28), dt(2020,5,29), dt(2020,5,30), dt(2020,6,1), dt(2020,6,2), dt(2020,6,3), dt(2020,6,28), dt(2020,6,29), dt(2020,7,7)]})
    >>> df
             date  qty
    0  2020-01-20    0
    1  2020-01-20    1
    2  2020-01-20    2
    3  2020-02-25    3
    4  2020-03-18    4
    5  2020-03-18    5
    6  2020-04-09    6
    7  2020-04-12    7
    8  2020-04-12    8
    9  2020-04-12    9
    10 2020-04-12   10
    11 2020-04-12   11
    12 2020-05-28   12
    13 2020-06-01   13
    14 2020-06-01   14
    15 2020-06-01   15
    16 2020-06-28   16
    17 2020-06-28   17
    >>> ref_df
         ref_date
    0  2019-12-08
    1  2020-01-20
    2  2020-02-25
    3  2020-03-18
    4  2020-04-09
    5  2020-04-10
    6  2020-04-12
    7  2020-04-13
    8  2020-04-14
    9  2020-05-28
    10 2020-05-29
    11 2020-05-30
    12 2020-06-01
    13 2020-06-02
    14 2020-06-03
    15 2020-06-28
    16 2020-06-29
    17 2020-07-07
    
    >>> df
             date  qty    ref_date
    0  2020-01-20    0  2019-12-08
    1  2020-01-20    1  2020-01-20  # Note: repeated as no gap
    2  2020-01-20    2  2020-01-20
    3  2020-02-25    3  2020-02-25
    4  2020-03-18    4  2020-03-18
    5  2020-03-18    5  2020-03-18  # Note: repeated as no gap
    6  2020-04-09    6  2020-04-09
    7  2020-04-12    7  2020-04-10  # Note: Filling from the edges
    8  2020-04-12    8  2020-04-12
    9  2020-04-12    9  2020-04-12  # Note: repeated as not enough gap
    10 2020-04-12   10  2020-04-13
    11 2020-04-12   11  2020-04-14
    12 2020-05-28   12  2020-05-28
    13 2020-06-01   13  2020-05-30  # Filling nearest previous
    14 2020-06-01   14  2020-06-01  # First filling previous
    15 2020-06-01   15  2020-06-02  # Filling nearest next
    16 2020-06-28   16  2020-06-28  
    17 2020-06-28   17  2020-06-29
    
    预期输出:

    >>> import pandas as pd
    >>> from datetime import datetime as dt
    >>> df = pd.DataFrame({'date':[dt(2020,1,20), dt(2020,1,20), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,3,18), dt(2020,4,9), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,4,12), dt(2020,5,28), dt(2020,6,1), dt(2020,6,1), dt(2020,6,1), dt(2020,6,28), dt(2020,6,28)], 'qty':range(18)})
    >>> ref_df = pd.DataFrame({'ref_date':[dt(2019,12,8), dt(2020,1,20), dt(2020,2,25), dt(2020,3,18), dt(2020,4,9), dt(2020,4,10), dt(2020,4,12), dt(2020,4,13), dt(2020,4,14), dt(2020,5,28), dt(2020,5,29), dt(2020,5,30), dt(2020,6,1), dt(2020,6,2), dt(2020,6,3), dt(2020,6,28), dt(2020,6,29), dt(2020,7,7)]})
    >>> df
             date  qty
    0  2020-01-20    0
    1  2020-01-20    1
    2  2020-01-20    2
    3  2020-02-25    3
    4  2020-03-18    4
    5  2020-03-18    5
    6  2020-04-09    6
    7  2020-04-12    7
    8  2020-04-12    8
    9  2020-04-12    9
    10 2020-04-12   10
    11 2020-04-12   11
    12 2020-05-28   12
    13 2020-06-01   13
    14 2020-06-01   14
    15 2020-06-01   15
    16 2020-06-28   16
    17 2020-06-28   17
    >>> ref_df
         ref_date
    0  2019-12-08
    1  2020-01-20
    2  2020-02-25
    3  2020-03-18
    4  2020-04-09
    5  2020-04-10
    6  2020-04-12
    7  2020-04-13
    8  2020-04-14
    9  2020-05-28
    10 2020-05-29
    11 2020-05-30
    12 2020-06-01
    13 2020-06-02
    14 2020-06-03
    15 2020-06-28
    16 2020-06-29
    17 2020-07-07
    
    >>> df
             date  qty    ref_date
    0  2020-01-20    0  2019-12-08
    1  2020-01-20    1  2020-01-20  # Note: repeated as no gap
    2  2020-01-20    2  2020-01-20
    3  2020-02-25    3  2020-02-25
    4  2020-03-18    4  2020-03-18
    5  2020-03-18    5  2020-03-18  # Note: repeated as no gap
    6  2020-04-09    6  2020-04-09
    7  2020-04-12    7  2020-04-10  # Note: Filling from the edges
    8  2020-04-12    8  2020-04-12
    9  2020-04-12    9  2020-04-12  # Note: repeated as not enough gap
    10 2020-04-12   10  2020-04-13
    11 2020-04-12   11  2020-04-14
    12 2020-05-28   12  2020-05-28
    13 2020-06-01   13  2020-05-30  # Filling nearest previous
    14 2020-06-01   14  2020-06-01  # First filling previous
    15 2020-06-01   15  2020-06-02  # Filling nearest next
    16 2020-06-28   16  2020-06-28  
    17 2020-06-28   17  2020-06-29
    

    我能够得到答案,但这似乎不是最有效的方法。有人能提出一个最佳的方法吗:

    ref_df['date'] = ref_df['ref_date']
    df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
    df = df.rename(columns={'ref_date':'nearest_ref_date'})
    nrd_cnt = df.groupby('nearest_ref_date')['date'].count().reset_index().rename(columns={'date':'nrd_count'})
    nrd_cnt['lc'] = nrd_cnt['nearest_ref_date'].shift(1)
    nrd_cnt['uc'] = nrd_cnt['nearest_ref_date'].shift(-1)
    df = df.merge(nrd_cnt, how='left', on='nearest_ref_date')
    # TODO: Review it. Looping it to finite number 100 to avoid infite loop (in case of edge cases)
    for _ in range(100):
        df2 = df.copy()
        df2['days'] = np.abs((df2['nearest_ref_date'] - df2['date']).dt.days)
        df2['repeat_rank'] = df2.groupby('nearest_ref_date')['days'].rank(method='first')
        reduced_ref_df = ref_df[~ref_df['ref_date'].isin(df2['nearest_ref_date'].unique())]
        df2 = pd.merge_asof(df2, reduced_ref_df, on='date', direction='nearest')
        df2 = df2.rename(columns={'ref_date':'new_nrd'})
        df2.loc[(df2['new_nrd']<=df2['lc']) | (df2['new_nrd']>=df2['uc']), 'new_nrd'] = pd.to_datetime(np.nan)
        df2.loc[(~pd.isna(df2['new_nrd'])) & (df2['repeat_rank'] > 1), 'nearest_ref_date'] = df2['new_nrd']
        df2 = df2[['date', 'qty', 'nearest_ref_date', 'lc', 'uc']]
        if df.equals(df2):
            break
        df = df2
    df = df[['date', 'qty', 'nearest_ref_date']]
    df.loc[:, 'repeat_rank'] = df.groupby('nearest_ref_date')['nearest_ref_date'].rank(method='first')
    df = pd.merge_asof(df, ref_df, on='date', direction='nearest')
    # Repeated nearest_ref_date set to nearest ref_date
    df.loc[df['repeat_rank'] > 1, 'nearest_ref_date'] = df['ref_date']
    # Sorting nearest_ref_date within the ref_date group (without changing order of rest of cols).
    df.loc[:, 'nearest_ref_date'] = df[['ref_date', 'nearest_ref_date']].sort_values(['ref_date', 'nearest_ref_date']).reset_index().drop('index',axis=1)['nearest_ref_date']
    df = df[['date', 'qty', 'nearest_ref_date']]
    df
             date  qty    ref_date
    0  2020-01-20    0  2019-12-08
    1  2020-01-20    1  2020-01-20
    2  2020-01-20    2  2020-01-20
    3  2020-02-25    3  2020-02-25
    4  2020-03-18    4  2020-03-18
    5  2020-03-18    5  2020-03-18
    6  2020-04-09    6  2020-04-09
    7  2020-04-12    7  2020-04-10
    8  2020-04-12    8  2020-04-12
    9  2020-04-12    9  2020-04-12
    10 2020-04-12   10  2020-04-13
    11 2020-04-12   11  2020-04-14
    12 2020-05-28   12  2020-05-28
    13 2020-06-01   13  2020-05-30
    14 2020-06-01   14  2020-06-01
    15 2020-06-01   15  2020-06-02
    16 2020-06-28   16  2020-06-28  
    17 2020-06-28   17  2020-06-29
    
    ref_-df['date']=ref_-df['ref_-date']
    df=pd.merge\u asof(df,ref\u df,on='date',direction='nearest')
    df=df.rename(列={'ref_date':'nearest_ref_date'})
    nrd_cnt=df.groupby('nearest_ref_date')['date'].count().reset_index().rename(列={'date':'nrd_count'})
    nrd_cnt['lc']=nrd_cnt['最近参考日期].班次(1)
    nrd_cnt['uc']=nrd_cnt['最近的参考日期].班次(-1)
    df=df.merge(nrd\u cnt,how='left',在='nearest\u ref\u date'上)
    #TODO:回顾一下。将其循环到有限数100,以避免内网循环(在边缘情况下)
    对于范围内的uu(100):
    df2=df.copy()
    df2['days']=np.abs((df2['nearest_ref_date']-df2['date']).dt.days)
    df2['repeat_rank']=df2.groupby('nearest_ref_date')['days'].rank(method='first')
    减少的参考日期=参考日期[~ref\u df['ref\u date'].isin(df2['nearest\u ref\u date'].unique())]
    df2=pd.merge\u asof(df2,减少的参考值,日期,方向,最近)
    df2=df2.rename(列={'ref\u date':'new\u nrd'})
    df2.loc[(df2['new\u nrd']=df2['uc']),'new\u nrd']=pd.to\u datetime(np.nan)
    df2.loc[(~pd.isna(df2['new\u nrd'))和(df2['repeat\u rank']>1),'nearest\u ref\u date']=df2['new\u nrd']
    df2=df2[[“日期”、“数量”、“最近的参考日期”、“lc”、“uc”]]
    如果df等于(df2):
    打破
    df=df2
    df=df[[日期”,“数量”,“最近的参考日期]]
    df.loc[:,'repeat_rank']=df.groupby('nearest_ref_date')['nearest_ref_date'].rank(method='first'))
    df=pd.merge\u asof(df,ref\u df,on='date',direction='nearest')
    #重复的最近参考日期设置为最近参考日期
    df.loc[df['repeat_rank']>1,“最近的参考日期”]=df['ref_date']
    #在ref_date组中对最近的ref_date进行排序(不更改其余col的顺序)。
    df.loc[:,“最近的参考日期”]=df[“参考日期”,“最近的参考日期”]]。对值进行排序([“参考日期”,“最近的参考日期]))。重置索引()。删除('index',轴=1)[“最近的参考日期”]
    df=df[[日期”,“数量”,“最近的参考日期]]
    df
    日期数量参考日期
    0  2020-01-20    0  2019-12-08
    1  2020-01-20    1  2020-01-20
    2  2020-01-20    2  2020-01-20
    3  2020-02-25    3  2020-02-25
    4  2020-03-18    4  2020-03-18
    5  2020-03-18    5  2020-03-18
    6  2020-04-09    6  2020-04-09
    7  2020-04-12    7  2020-04-10
    8  2020-04-12    8  2020-04-12
    9  2020-04-12    9  2020-04-12
    10 2020-04-12   10  2020-04-13
    11 2020-04-12   11  2020-04-14
    12 2020-05-28   12  2020-05-28
    13 2020-06-01   13  2020-05-30
    14 2020-06-01   14  2020-06-01
    15 2020-06-01   15  2020-06-02
    16 2020-06-28   16  2020-06-28  
    17 2020-06-28   17  2020-06-29