Python 同时填充缺失的日期和桶_Python_Pandas_Jupyter Notebook

Python 同时填充缺失的日期和桶

python pandas jupyter-notebook

Python 同时填充缺失的日期和桶,python,pandas,jupyter-notebook,Python,Pandas,Jupyter Notebook,我有一个像 Start_MONTH Bucket Count Complete Partial 10/01/2015 0 57 91 0.66 11/01/2015 0 678 8 0.99 02/01/2016 0 68 12 0.12 10/01/2015 1 78 79 0.22 11/01/2015

我有一个像

Start_MONTH  Bucket Count Complete Partial

 10/01/2015      0       57     91      0.66

 11/01/2015      0       678    8       0.99

 02/01/2016      0        68    12       0.12

 10/01/2015      1       78     79      0.22

 11/01/2015      1       99     56     0.67

 1/01/2016       1       789    67     0.78

 10/01/2015      3       678    178    0.780

11/01/2015       3       2880   578     0.678

基本上，我需要填写每个起始月（缺少2015年1月12日，2016年1月1日，…），缺少像2这样的每个桶，对于缺少的桶和起始月，其余列（计数、完成、部分）将为零。我认为使用relativedelta（months=+1）会有所帮助，但不确定如何使用它

pandas as pd
data =  [['10/01/2015',0 ,57 ,91,0.66],
 ['11/01/2015',0, 678, 8,0.99],
  ['02/01/2016',0,68,12,0.12],
  ['10/01/2015' ,1, 78,79,0.22],
  ['11/01/2015' ,1 ,99,56, 0.67],
  ['1/01/2016', 1 ,789,67,0.78],
  ['10/01/2015', 3,678, 178, 0.780],
  ['11/01/2015' ,3, 2880,578,0.678]]
df = pd.DataFrame(data, columns = ['Start_Month', 'Bucket', 'Count', 
'Complete','Partial'])

基本上，我希望起始月和桶组作为一个组重复自身，其他值为0，即从2015年1月10日到2016年1月2日（缺少2015年1月12日，2016年1月1日），所有月份都在那里，0-3（缺少2）的桶都需要在那里

我试过了，它部分满足了我的要求

df['Start_Month'] = pd.to_datetime(df['Start_Month'])
s = df.groupby(['Bucket',pd.Grouper(key='Start_Month', freq='MS')])['Count','Complete','Partial'].sum()
df1 = (s.reset_index(level=0)
    .groupby('Bucket')['Count','Complete','Partial']
    .apply(lambda x: x.asfreq('MS'))
    .reset_index())

它添加了一些缺失的月份，但没有对每个存储桶重复，也没有在两者之间添加存储桶整数。下面是一个解决方案的frankenhack，直到有人发布了应该如何执行：

启动df 为bucket 0创建一个单独的df，其中包含完整的日期根据相应的日期和桶过滤原始df，并将结果与df0合并对铲斗1、2和3重复此过程，创建df1、df2和df3。（由于重复，所以没有显示……当然，您可以在循环中执行此操作）。然后将所有4个df合在一起，并用零填充na

# Concat
df_final = pd.concat([df0, df1, df2, df3], axis=0).fillna(0)

    Start_Month Bucket       Count   Complete   Partial
0   2015-10-01       0       57.0        91.0   0.660
1   2015-11-01       0       678.0       8.0    0.990
2   2015-12-01       0       0.0         0.0    0.000
3   2016-01-01       0       0.0         0.0    0.000
4   2016-02-01       0       68.0        12.0   0.120
0   2015-10-01       1       78.0        79.0   0.220
1   2015-11-01       1       99.0        56.0   0.670
2   2015-12-01       1       0.0         0.0    0.000
3   2016-01-01       1       789.0       67.0   0.780
4   2016-02-01       1       0.0         0.0    0.000
0   2015-10-01       2       0.0         0.0    0.000
1   2015-11-01       2       0.0         0.0    0.000
2   2015-12-01       2       0.0         0.0    0.000
3   2016-01-01       2       0.0         0.0    0.000
4   2016-02-01       2       0.0         0.0    0.000
0   2015-10-01       3       678.0       178.0  0.780
1   2015-11-01       3       2880.0      578.0  0.678
2   2015-12-01       3       0.0         0.0    0.000
3   2016-01-01       3       0.0         0.0    0.000
4   2016-02-01       3       0.0         0.0    0.000

更新：显示完全循环的代码，并在评论中回答您的问题。至于您在评论中的问题，您可以得到一个空数据框，其中包含重复的日期和存储桶序列，如下所示：

bucket_list = [ele for ele in [0,1,2,3] for i in range(5)]
dates = list(pd.date_range('2015-10-01', '2016-02-01', freq='MS'))*4
df = pd.DataFrame(data=[dates, bucket_list]).T.rename(columns={0:'Start_Month', 1:'Bucket'})

Output:
    Start_Month Bucket
0   2015-10-01  0
1   2015-11-01  0
2   2015-12-01  0
3   2016-01-01  0
4   2016-02-01  0
5   2015-10-01  1
6   2015-11-01  1
7   2015-12-01  1
8   2016-01-01  1
9   2016-02-01  1
10  2015-10-01  2
11  2015-11-01  2
12  2015-12-01  2
13  2016-01-01  2
14  2016-02-01  2
15  2015-10-01  3
16  2015-11-01  3
17  2015-12-01  3
18  2016-01-01  3
19  2016-02-01  3

写了一篇类似的文章，但只是概括了一点

 import pandas as pd
 import numpy as np

 # converting date string to date
 df['Start_Month'] = pd.to_datetime(df['Start_Month'])

 # finding the the date range and increasin by 1 month start
 rng = pd.date_range(df['Start_Month'].min(),df['Start_Month'].max(), freq='MS')

 # creating date dataframe
 df1 = pd.DataFrame({ 'Start_Month': rng})

 # Converting bucket field to integer
 df['Bucket'] = df['Bucket'].astype(int)

 # finding the bucket values max and min
Bucket=np.arange(df['Bucket'].min(),df['Bucket'].max()+1,1)

 # Repeating the date range for every bucket
df1=pd.concat([df1]*len(Bucket))

 # repeating bucket values to each date
df1['Bucket']=np.repeat(Bucket, len(rng))

# merging to the previous dataframe and filling it with 0
merged_left = pd.merge(left=df1, right=df, how='left', on=['Start_Month','Bucket']).fillna(0)

是否可以创建一个虚拟表，将日期和存储桶从第一个值到最后一个值的范围放在一起？我认为这将使它更容易和更自动化

# Concat
df_final = pd.concat([df0, df1, df2, df3], axis=0).fillna(0)

    Start_Month Bucket       Count   Complete   Partial
0   2015-10-01       0       57.0        91.0   0.660
1   2015-11-01       0       678.0       8.0    0.990
2   2015-12-01       0       0.0         0.0    0.000
3   2016-01-01       0       0.0         0.0    0.000
4   2016-02-01       0       68.0        12.0   0.120
0   2015-10-01       1       78.0        79.0   0.220
1   2015-11-01       1       99.0        56.0   0.670
2   2015-12-01       1       0.0         0.0    0.000
3   2016-01-01       1       789.0       67.0   0.780
4   2016-02-01       1       0.0         0.0    0.000
0   2015-10-01       2       0.0         0.0    0.000
1   2015-11-01       2       0.0         0.0    0.000
2   2015-12-01       2       0.0         0.0    0.000
3   2016-01-01       2       0.0         0.0    0.000
4   2016-02-01       2       0.0         0.0    0.000
0   2015-10-01       3       678.0       178.0  0.780
1   2015-11-01       3       2880.0      578.0  0.678
2   2015-12-01       3       0.0         0.0    0.000
3   2016-01-01       3       0.0         0.0    0.000
4   2016-02-01       3       0.0         0.0    0.000

def get_separate_df(df, bucket_num):
    df_bucket = pd.DataFrame([pd.date_range('2015-10-01', '2016-02-01', freq='MS'), 
                        [bucket_num]*5]).T.rename(columns={0: 'Start_Month', 1:'Bucket'})
    df_filt = df[(df['Start_Month'].isin(df_bucket['Start_Month'])) & \
                        (df['Bucket'] == bucket_num)]
    df_bucket = pd.merge(df_bucket, df_filt, left_on='Start_Month', right_on='Start_Month', how='outer')
    df_bucket = df_bucket.drop('Bucket_y', axis=1).rename(columns={'Bucket_x': 'Bucket'})

    return df_bucket

dfs = [get_separate_df(df, i) for i in range(4)] 

# Concat
df_final = pd.concat(dfs, axis=0).fillna(0)

bucket_list = [ele for ele in [0,1,2,3] for i in range(5)]
dates = list(pd.date_range('2015-10-01', '2016-02-01', freq='MS'))*4
df = pd.DataFrame(data=[dates, bucket_list]).T.rename(columns={0:'Start_Month', 1:'Bucket'})

Output:
    Start_Month Bucket
0   2015-10-01  0
1   2015-11-01  0
2   2015-12-01  0
3   2016-01-01  0
4   2016-02-01  0
5   2015-10-01  1
6   2015-11-01  1
7   2015-12-01  1
8   2016-01-01  1
9   2016-02-01  1
10  2015-10-01  2
11  2015-11-01  2
12  2015-12-01  2
13  2016-01-01  2
14  2016-02-01  2
15  2015-10-01  3
16  2015-11-01  3
17  2015-12-01  3
18  2016-01-01  3
19  2016-02-01  3

 import pandas as pd
 import numpy as np

 # converting date string to date
 df['Start_Month'] = pd.to_datetime(df['Start_Month'])

 # finding the the date range and increasin by 1 month start
 rng = pd.date_range(df['Start_Month'].min(),df['Start_Month'].max(), freq='MS')

 # creating date dataframe
 df1 = pd.DataFrame({ 'Start_Month': rng})

 # Converting bucket field to integer
 df['Bucket'] = df['Bucket'].astype(int)

 # finding the bucket values max and min
Bucket=np.arange(df['Bucket'].min(),df['Bucket'].max()+1,1)

 # Repeating the date range for every bucket
df1=pd.concat([df1]*len(Bucket))

 # repeating bucket values to each date
df1['Bucket']=np.repeat(Bucket, len(rng))

# merging to the previous dataframe and filling it with 0
merged_left = pd.merge(left=df1, right=df, how='left', on=['Start_Month','Bucket']).fillna(0)