Python 同时填充缺失的日期和桶
我有一个像Python 同时填充缺失的日期和桶,python,pandas,jupyter-notebook,Python,Pandas,Jupyter Notebook,我有一个像 Start_MONTH Bucket Count Complete Partial 10/01/2015 0 57 91 0.66 11/01/2015 0 678 8 0.99 02/01/2016 0 68 12 0.12 10/01/2015 1 78 79 0.22 11/01/2015
Start_MONTH Bucket Count Complete Partial
10/01/2015 0 57 91 0.66
11/01/2015 0 678 8 0.99
02/01/2016 0 68 12 0.12
10/01/2015 1 78 79 0.22
11/01/2015 1 99 56 0.67
1/01/2016 1 789 67 0.78
10/01/2015 3 678 178 0.780
11/01/2015 3 2880 578 0.678
基本上,我需要填写每个起始月(缺少2015年1月12日,2016年1月1日,…),缺少像2这样的每个桶,对于缺少的桶和起始月,其余列(计数、完成、部分)将为零。我认为使用relativedelta(months=+1)会有所帮助,但不确定如何使用它
pandas as pd
data = [['10/01/2015',0 ,57 ,91,0.66],
['11/01/2015',0, 678, 8,0.99],
['02/01/2016',0,68,12,0.12],
['10/01/2015' ,1, 78,79,0.22],
['11/01/2015' ,1 ,99,56, 0.67],
['1/01/2016', 1 ,789,67,0.78],
['10/01/2015', 3,678, 178, 0.780],
['11/01/2015' ,3, 2880,578,0.678]]
df = pd.DataFrame(data, columns = ['Start_Month', 'Bucket', 'Count',
'Complete','Partial'])
基本上,我希望起始月和桶组作为一个组重复自身,其他值为0,即从2015年1月10日到2016年1月2日(缺少2015年1月12日,2016年1月1日),所有月份都在那里,0-3(缺少2)的桶都需要在那里
我试过了,它部分满足了我的要求
df['Start_Month'] = pd.to_datetime(df['Start_Month'])
s = df.groupby(['Bucket',pd.Grouper(key='Start_Month', freq='MS')])['Count','Complete','Partial'].sum()
df1 = (s.reset_index(level=0)
.groupby('Bucket')['Count','Complete','Partial']
.apply(lambda x: x.asfreq('MS'))
.reset_index())
它添加了一些缺失的月份,但没有对每个存储桶重复,也没有在两者之间添加存储桶整数。下面是一个解决方案的frankenhack,直到有人发布了应该如何执行: 启动df 为bucket 0创建一个单独的df,其中包含完整的日期 根据相应的日期和桶过滤原始df,并将结果与df0合并 对铲斗1、2和3重复此过程,创建df1、df2和df3。 (由于重复,所以没有显示……当然,您可以在循环中执行此操作)。然后将所有4个df合在一起,并用零填充na
# Concat
df_final = pd.concat([df0, df1, df2, df3], axis=0).fillna(0)
Start_Month Bucket Count Complete Partial
0 2015-10-01 0 57.0 91.0 0.660
1 2015-11-01 0 678.0 8.0 0.990
2 2015-12-01 0 0.0 0.0 0.000
3 2016-01-01 0 0.0 0.0 0.000
4 2016-02-01 0 68.0 12.0 0.120
0 2015-10-01 1 78.0 79.0 0.220
1 2015-11-01 1 99.0 56.0 0.670
2 2015-12-01 1 0.0 0.0 0.000
3 2016-01-01 1 789.0 67.0 0.780
4 2016-02-01 1 0.0 0.0 0.000
0 2015-10-01 2 0.0 0.0 0.000
1 2015-11-01 2 0.0 0.0 0.000
2 2015-12-01 2 0.0 0.0 0.000
3 2016-01-01 2 0.0 0.0 0.000
4 2016-02-01 2 0.0 0.0 0.000
0 2015-10-01 3 678.0 178.0 0.780
1 2015-11-01 3 2880.0 578.0 0.678
2 2015-12-01 3 0.0 0.0 0.000
3 2016-01-01 3 0.0 0.0 0.000
4 2016-02-01 3 0.0 0.0 0.000
更新:显示完全循环的代码,并在评论中回答您的问题。
至于您在评论中的问题,您可以得到一个空数据框,其中包含重复的日期和存储桶序列,如下所示:
bucket_list = [ele for ele in [0,1,2,3] for i in range(5)]
dates = list(pd.date_range('2015-10-01', '2016-02-01', freq='MS'))*4
df = pd.DataFrame(data=[dates, bucket_list]).T.rename(columns={0:'Start_Month', 1:'Bucket'})
Output:
Start_Month Bucket
0 2015-10-01 0
1 2015-11-01 0
2 2015-12-01 0
3 2016-01-01 0
4 2016-02-01 0
5 2015-10-01 1
6 2015-11-01 1
7 2015-12-01 1
8 2016-01-01 1
9 2016-02-01 1
10 2015-10-01 2
11 2015-11-01 2
12 2015-12-01 2
13 2016-01-01 2
14 2016-02-01 2
15 2015-10-01 3
16 2015-11-01 3
17 2015-12-01 3
18 2016-01-01 3
19 2016-02-01 3
写了一篇类似的文章,但只是概括了一点
import pandas as pd
import numpy as np
# converting date string to date
df['Start_Month'] = pd.to_datetime(df['Start_Month'])
# finding the the date range and increasin by 1 month start
rng = pd.date_range(df['Start_Month'].min(),df['Start_Month'].max(), freq='MS')
# creating date dataframe
df1 = pd.DataFrame({ 'Start_Month': rng})
# Converting bucket field to integer
df['Bucket'] = df['Bucket'].astype(int)
# finding the bucket values max and min
Bucket=np.arange(df['Bucket'].min(),df['Bucket'].max()+1,1)
# Repeating the date range for every bucket
df1=pd.concat([df1]*len(Bucket))
# repeating bucket values to each date
df1['Bucket']=np.repeat(Bucket, len(rng))
# merging to the previous dataframe and filling it with 0
merged_left = pd.merge(left=df1, right=df, how='left', on=['Start_Month','Bucket']).fillna(0)
是否可以创建一个虚拟表,将日期和存储桶从第一个值到最后一个值的范围放在一起?我认为这将使它更容易和更自动化
# Concat
df_final = pd.concat([df0, df1, df2, df3], axis=0).fillna(0)
Start_Month Bucket Count Complete Partial
0 2015-10-01 0 57.0 91.0 0.660
1 2015-11-01 0 678.0 8.0 0.990
2 2015-12-01 0 0.0 0.0 0.000
3 2016-01-01 0 0.0 0.0 0.000
4 2016-02-01 0 68.0 12.0 0.120
0 2015-10-01 1 78.0 79.0 0.220
1 2015-11-01 1 99.0 56.0 0.670
2 2015-12-01 1 0.0 0.0 0.000
3 2016-01-01 1 789.0 67.0 0.780
4 2016-02-01 1 0.0 0.0 0.000
0 2015-10-01 2 0.0 0.0 0.000
1 2015-11-01 2 0.0 0.0 0.000
2 2015-12-01 2 0.0 0.0 0.000
3 2016-01-01 2 0.0 0.0 0.000
4 2016-02-01 2 0.0 0.0 0.000
0 2015-10-01 3 678.0 178.0 0.780
1 2015-11-01 3 2880.0 578.0 0.678
2 2015-12-01 3 0.0 0.0 0.000
3 2016-01-01 3 0.0 0.0 0.000
4 2016-02-01 3 0.0 0.0 0.000
def get_separate_df(df, bucket_num):
df_bucket = pd.DataFrame([pd.date_range('2015-10-01', '2016-02-01', freq='MS'),
[bucket_num]*5]).T.rename(columns={0: 'Start_Month', 1:'Bucket'})
df_filt = df[(df['Start_Month'].isin(df_bucket['Start_Month'])) & \
(df['Bucket'] == bucket_num)]
df_bucket = pd.merge(df_bucket, df_filt, left_on='Start_Month', right_on='Start_Month', how='outer')
df_bucket = df_bucket.drop('Bucket_y', axis=1).rename(columns={'Bucket_x': 'Bucket'})
return df_bucket
dfs = [get_separate_df(df, i) for i in range(4)]
# Concat
df_final = pd.concat(dfs, axis=0).fillna(0)
bucket_list = [ele for ele in [0,1,2,3] for i in range(5)]
dates = list(pd.date_range('2015-10-01', '2016-02-01', freq='MS'))*4
df = pd.DataFrame(data=[dates, bucket_list]).T.rename(columns={0:'Start_Month', 1:'Bucket'})
Output:
Start_Month Bucket
0 2015-10-01 0
1 2015-11-01 0
2 2015-12-01 0
3 2016-01-01 0
4 2016-02-01 0
5 2015-10-01 1
6 2015-11-01 1
7 2015-12-01 1
8 2016-01-01 1
9 2016-02-01 1
10 2015-10-01 2
11 2015-11-01 2
12 2015-12-01 2
13 2016-01-01 2
14 2016-02-01 2
15 2015-10-01 3
16 2015-11-01 3
17 2015-12-01 3
18 2016-01-01 3
19 2016-02-01 3
import pandas as pd
import numpy as np
# converting date string to date
df['Start_Month'] = pd.to_datetime(df['Start_Month'])
# finding the the date range and increasin by 1 month start
rng = pd.date_range(df['Start_Month'].min(),df['Start_Month'].max(), freq='MS')
# creating date dataframe
df1 = pd.DataFrame({ 'Start_Month': rng})
# Converting bucket field to integer
df['Bucket'] = df['Bucket'].astype(int)
# finding the bucket values max and min
Bucket=np.arange(df['Bucket'].min(),df['Bucket'].max()+1,1)
# Repeating the date range for every bucket
df1=pd.concat([df1]*len(Bucket))
# repeating bucket values to each date
df1['Bucket']=np.repeat(Bucket, len(rng))
# merging to the previous dataframe and filling it with 0
merged_left = pd.merge(left=df1, right=df, how='left', on=['Start_Month','Bucket']).fillna(0)