Python 熊猫:给定一个开始和结束日期,为中间的每一天添加一列,然后添加值?
这是我的数据:Python 熊猫:给定一个开始和结束日期,为中间的每一天添加一列,然后添加值?,python,pandas,Python,Pandas,这是我的数据: df = pd.DataFrame([ {start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1} {start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2} {start_date: '2019/12/01', end_date: '', spend: 100
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
我需要为2019/12/01以来每天的每一行添加一列,并计算该活动当天的花费,我将通过将该活动的花费除以其活动的总天数得到
所以在这里我要为12月1日到今天12月10日之间的每一天添加一列。对于第1行,12月1日至5日的五列内容为2000,而12月5日至10日的六列内容为零
我知道熊猫是为这类问题精心设计的,但我不知道从哪里开始 对我来说似乎不是一项直截了当的任务。但如果尚未转换日期列,请首先转换日期列:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
然后创建用于重采样的辅助函数:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
现在这是一个三步的过程。首先根据每组的结束日期对数据重新采样:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
计算平均值后,重新采样到当前日期:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
最后,转置数据帧:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01
您如何处理第三行中缺少的结束日期?您能给出一个预期的结束日期吗output@AdibP抱歉,应该指定-应设置为今天的日期。