Python 如果事件发生在时间窗口中(未来或过去),则创建新列

Python 如果事件发生在时间窗口中(未来或过去),则创建新列,python,pandas,Python,Pandas,我有一个具有以下结构的熊猫数据帧: ID date e_1 1 2016-02-01 False 2016-02-02 False 2016-02-03 True 2016-02-04 False 2016-02-05 False 2016-02-06 False 2016-02-07 False 2016-0

我有一个具有以下结构的熊猫数据帧:

ID    date           e_1   
 1    2016-02-01     False 
      2016-02-02     False 
      2016-02-03     True  
      2016-02-04     False
      2016-02-05     False
      2016-02-06     False
      2016-02-07     False
      2016-02-08     False
      2016-02-09     False
      2016-02-10     False  
 2    2016-02-01     False  
      2016-02-02     True    
      2016-02-03     True    
      2016-02-04     False  
          ...         ...
我想添加几个列来编码以下内容:以下
1d
2d
3d
4d
5d
1个月
。。。等等

我想在列表中指定时间差。列的名称将是
e1_XX
,其中
XX
是增量(即
1d
,等等)

我尝试了
shift
,但这只是移动了值。还尝试了
滚动
(似乎更适合此任务):


但是我不知道如何通过这个条件(我想是在
np.any
),但是我被卡住了你可以使用
groupby
滚动

df.groupby('ID').e_1.apply(lambda x : x.iloc[::-1].rolling(window=3,min_periods=1).apply(any).iloc[::-1].astype(bool))
Out[51]: 
ID  date      
1   2016-02-01     True
    2016-02-02     True
    2016-02-03     True
    2016-02-04    False
    2016-02-05    False
    2016-02-06    False
    2016-02-07    False
    2016-02-08    False
    2016-02-09    False
    2016-02-10    False
2   2016-02-01     True
    2016-02-02     True
    2016-02-03     True
    2016-02-04    False
Name: e_1, dtype: bool
编辑:
groupby
索引ID,然后我们为每个ID设置一系列e_1,并检查滚动,它可以接受偏移量,这意味着当索引为datetime时,它可以使用
offset
(3d表示3天)来确定窗口大小

df.groupby('ID').e_1.apply(lambda x : x.reset_index(level=0,drop=True).rolling('3d').apply(any))
更新时,我们需要创建另一列来提供帮助,此逻辑等于[:-1],但就在您使用时间索引时:索引必须是单调的


检查以下代码并查看其是否有效:

# make sure date is in valid Pandas datetime format
mydf['date'] = pd.to_datetime(mydf['date'], format='%Y-%m-%d')

# use date as index to make it easier in date manipulations
mydf.set_index('date', inplace=True)

def flag_visits(grps, d, d_name):
    """Loop through each group and extend the index to 'd' more days from
       df_grp.index.max(). fill the NaN values with *False*
       this is needed to retrieve the forward rolling stats when running shift(1-d)
    """
    for id, df_grp in grps:
        # create the new index to cover all days required in calculation
        idx = pd.date_range(
              start = df_grp.index.min()
            , end   = df_grp.index.max() + pd.DateOffset(days=d)
            , freq  = 'D'
        )

        # set up the new column 'd_name' for the current group
        mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.reindex(idx, fill_value=False)
                                                       .e_1.rolling(str(d)+'d', min_periods=0)
                                                       .sum().gt(0)
                                                       .shift(1-d)
        )

# if you know the dates are continue without gap, then you might also reverse the dates, do the regular backward rolling(), and then flip it back. However, you can not do the rolling() by the number of day, only by the number of records. 
def flag_visits_1(grps, d, d_name):
    for id, df_grp in grps:
        mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.sort_index(ascending=False)
                                                       .e_1.rolling(d, min_periods=0)
                                                       .sum().gt(0).sort_index()
        )



# d is the actual number of days used in Series.rolling(), d_name used in the column name"""
for d, d_name in [ (2, '1d') , (3, '2d'), (7, '6d'), (30, '1m') ]:
    mydf.groupby('ID').pipe(flag_visits, d, d_name)

# drop date from the index 
mydf.reset_index(inplace=True)

print(mydf)
注意:

  • 如果
    next1天
    不包括今天,因此当d_name='1d',d==1时,您可以将
    shift(1-d)
    调整为
    shift(-d)
  • 每个ID的日期字段必须是唯一的,否则您将无法设置_index()

谢谢!你能回答问题中关于增加几个栏目的部分吗?另外,你能解释一下你的方法吗?你的第一种方法有效,但第二种方法不正确,我的身份证日期为2016-02-01 1.0 2016-02-02 1.0 2016-02-03 1.0 2016-02-04 0.0 2016-02-05 0.0 2016-02-06 0.0 2016-02-07 0.0 2016-02-08 0.0 2016-02-09 0.0 2016-02-10 0.0 2 2016-02-01 1.0 2016-02-02 1.0 2016-02-02 1.0 2016-02-03 1.0 2016-02-040.0
(也就是说,有点倒退)
df['New']=pd.to_datetime('today')+(pd.to_datetime('today')-df.index.get_level_values(1))
df=df.sort_index(level=0).sort_values('New')
df['New']=df.groupby('ID',sort=False).apply(lambda x : x.reset_index(drop=True).set_index('New')['e_1'].rolling('3d',min_periods=1).apply(any)).sort_index(level=1).values.astype(bool)
df.sort_index()
Out[278]: 
                 e_1    New
ID date                    
1  2016-02-01  False   True
   2016-02-02  False   True
   2016-02-03   True   True
   2016-02-04  False  False
   2016-02-05  False  False
   2016-02-06  False  False
   2016-02-07  False  False
   2016-02-08  False  False
   2016-02-09  False  False
   2016-02-10  False  False
2  2016-02-01  False   True
   2016-02-02   True   True
   2016-02-03   True   True
   2016-02-04  False  False
# make sure date is in valid Pandas datetime format
mydf['date'] = pd.to_datetime(mydf['date'], format='%Y-%m-%d')

# use date as index to make it easier in date manipulations
mydf.set_index('date', inplace=True)

def flag_visits(grps, d, d_name):
    """Loop through each group and extend the index to 'd' more days from
       df_grp.index.max(). fill the NaN values with *False*
       this is needed to retrieve the forward rolling stats when running shift(1-d)
    """
    for id, df_grp in grps:
        # create the new index to cover all days required in calculation
        idx = pd.date_range(
              start = df_grp.index.min()
            , end   = df_grp.index.max() + pd.DateOffset(days=d)
            , freq  = 'D'
        )

        # set up the new column 'd_name' for the current group
        mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.reindex(idx, fill_value=False)
                                                       .e_1.rolling(str(d)+'d', min_periods=0)
                                                       .sum().gt(0)
                                                       .shift(1-d)
        )

# if you know the dates are continue without gap, then you might also reverse the dates, do the regular backward rolling(), and then flip it back. However, you can not do the rolling() by the number of day, only by the number of records. 
def flag_visits_1(grps, d, d_name):
    for id, df_grp in grps:
        mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.sort_index(ascending=False)
                                                       .e_1.rolling(d, min_periods=0)
                                                       .sum().gt(0).sort_index()
        )



# d is the actual number of days used in Series.rolling(), d_name used in the column name"""
for d, d_name in [ (2, '1d') , (3, '2d'), (7, '6d'), (30, '1m') ]:
    mydf.groupby('ID').pipe(flag_visits, d, d_name)

# drop date from the index 
mydf.reset_index(inplace=True)

print(mydf)