Python 如果事件发生在时间窗口中(未来或过去),则创建新列
我有一个具有以下结构的熊猫数据帧:Python 如果事件发生在时间窗口中(未来或过去),则创建新列,python,pandas,Python,Pandas,我有一个具有以下结构的熊猫数据帧: ID date e_1 1 2016-02-01 False 2016-02-02 False 2016-02-03 True 2016-02-04 False 2016-02-05 False 2016-02-06 False 2016-02-07 False 2016-0
ID date e_1
1 2016-02-01 False
2016-02-02 False
2016-02-03 True
2016-02-04 False
2016-02-05 False
2016-02-06 False
2016-02-07 False
2016-02-08 False
2016-02-09 False
2016-02-10 False
2 2016-02-01 False
2016-02-02 True
2016-02-03 True
2016-02-04 False
... ...
我想添加几个列来编码以下内容:以下1d
,2d
,3d
,4d
,5d
,1个月
。。。等等
我想在列表中指定时间差。列的名称将是e1_XX
,其中XX
是增量(即1d
,等等)
我尝试了shift
,但这只是移动了值。还尝试了滚动(似乎更适合此任务):
但是我不知道如何通过这个条件(我想是在np.any
),但是我被卡住了你可以使用groupby
和滚动
df.groupby('ID').e_1.apply(lambda x : x.iloc[::-1].rolling(window=3,min_periods=1).apply(any).iloc[::-1].astype(bool))
Out[51]:
ID date
1 2016-02-01 True
2016-02-02 True
2016-02-03 True
2016-02-04 False
2016-02-05 False
2016-02-06 False
2016-02-07 False
2016-02-08 False
2016-02-09 False
2016-02-10 False
2 2016-02-01 True
2016-02-02 True
2016-02-03 True
2016-02-04 False
Name: e_1, dtype: bool
编辑:groupby
索引ID,然后我们为每个ID设置一系列e_1,并检查滚动,它可以接受偏移量,这意味着当索引为datetime时,它可以使用offset
(3d表示3天)来确定窗口大小
df.groupby('ID').e_1.apply(lambda x : x.reset_index(level=0,drop=True).rolling('3d').apply(any))
更新时,我们需要创建另一列来提供帮助,此逻辑等于[:-1],但就在您使用时间索引时:索引必须是单调的
检查以下代码并查看其是否有效:
# make sure date is in valid Pandas datetime format
mydf['date'] = pd.to_datetime(mydf['date'], format='%Y-%m-%d')
# use date as index to make it easier in date manipulations
mydf.set_index('date', inplace=True)
def flag_visits(grps, d, d_name):
"""Loop through each group and extend the index to 'd' more days from
df_grp.index.max(). fill the NaN values with *False*
this is needed to retrieve the forward rolling stats when running shift(1-d)
"""
for id, df_grp in grps:
# create the new index to cover all days required in calculation
idx = pd.date_range(
start = df_grp.index.min()
, end = df_grp.index.max() + pd.DateOffset(days=d)
, freq = 'D'
)
# set up the new column 'd_name' for the current group
mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.reindex(idx, fill_value=False)
.e_1.rolling(str(d)+'d', min_periods=0)
.sum().gt(0)
.shift(1-d)
)
# if you know the dates are continue without gap, then you might also reverse the dates, do the regular backward rolling(), and then flip it back. However, you can not do the rolling() by the number of day, only by the number of records.
def flag_visits_1(grps, d, d_name):
for id, df_grp in grps:
mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.sort_index(ascending=False)
.e_1.rolling(d, min_periods=0)
.sum().gt(0).sort_index()
)
# d is the actual number of days used in Series.rolling(), d_name used in the column name"""
for d, d_name in [ (2, '1d') , (3, '2d'), (7, '6d'), (30, '1m') ]:
mydf.groupby('ID').pipe(flag_visits, d, d_name)
# drop date from the index
mydf.reset_index(inplace=True)
print(mydf)
注意:
- 如果
next1天
不包括今天,因此当d_name='1d',d==1时,您可以将shift(1-d)
调整为shift(-d)
- 每个ID的日期字段必须是唯一的,否则您将无法设置_index()
谢谢!你能回答问题中关于增加几个栏目的部分吗?另外,你能解释一下你的方法吗?你的第一种方法有效,但第二种方法不正确,我的身份证日期为2016-02-01 1.0 2016-02-02 1.0 2016-02-03 1.0 2016-02-04 0.0 2016-02-05 0.0 2016-02-06 0.0 2016-02-07 0.0 2016-02-08 0.0 2016-02-09 0.0 2016-02-10 0.0 2 2016-02-01 1.0 2016-02-02 1.0 2016-02-02 1.0 2016-02-03 1.0 2016-02-040.0
(也就是说,有点倒退)
df['New']=pd.to_datetime('today')+(pd.to_datetime('today')-df.index.get_level_values(1))
df=df.sort_index(level=0).sort_values('New')
df['New']=df.groupby('ID',sort=False).apply(lambda x : x.reset_index(drop=True).set_index('New')['e_1'].rolling('3d',min_periods=1).apply(any)).sort_index(level=1).values.astype(bool)
df.sort_index()
Out[278]:
e_1 New
ID date
1 2016-02-01 False True
2016-02-02 False True
2016-02-03 True True
2016-02-04 False False
2016-02-05 False False
2016-02-06 False False
2016-02-07 False False
2016-02-08 False False
2016-02-09 False False
2016-02-10 False False
2 2016-02-01 False True
2016-02-02 True True
2016-02-03 True True
2016-02-04 False False
# make sure date is in valid Pandas datetime format
mydf['date'] = pd.to_datetime(mydf['date'], format='%Y-%m-%d')
# use date as index to make it easier in date manipulations
mydf.set_index('date', inplace=True)
def flag_visits(grps, d, d_name):
"""Loop through each group and extend the index to 'd' more days from
df_grp.index.max(). fill the NaN values with *False*
this is needed to retrieve the forward rolling stats when running shift(1-d)
"""
for id, df_grp in grps:
# create the new index to cover all days required in calculation
idx = pd.date_range(
start = df_grp.index.min()
, end = df_grp.index.max() + pd.DateOffset(days=d)
, freq = 'D'
)
# set up the new column 'd_name' for the current group
mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.reindex(idx, fill_value=False)
.e_1.rolling(str(d)+'d', min_periods=0)
.sum().gt(0)
.shift(1-d)
)
# if you know the dates are continue without gap, then you might also reverse the dates, do the regular backward rolling(), and then flip it back. However, you can not do the rolling() by the number of day, only by the number of records.
def flag_visits_1(grps, d, d_name):
for id, df_grp in grps:
mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.sort_index(ascending=False)
.e_1.rolling(d, min_periods=0)
.sum().gt(0).sort_index()
)
# d is the actual number of days used in Series.rolling(), d_name used in the column name"""
for d, d_name in [ (2, '1d') , (3, '2d'), (7, '6d'), (30, '1m') ]:
mydf.groupby('ID').pipe(flag_visits, d, d_name)
# drop date from the index
mydf.reset_index(inplace=True)
print(mydf)