Python使用缺少的值填充数据帧_Python_Pandas

Python使用缺少的值填充数据帧

python pandas

Python使用缺少的值填充数据帧,python,pandas,Python,Pandas,我以这个数据帧为例 import pandas as pd #create dataframe df = pd.DataFrame([['DE', 'Table',201705,201705, 1000], ['DE', 'Table',201705,201704, 1000],\ ['DE', 'Table',201705,201702, 1000], ['DE', 'Table',201705,201701, 1000],\

我以这个数据帧为例

import pandas as pd

#create dataframe
df = pd.DataFrame([['DE', 'Table',201705,201705, 1000], ['DE', 'Table',201705,201704, 1000],\
                   ['DE', 'Table',201705,201702, 1000], ['DE', 'Table',201705,201701, 1000],\
                   ['AT', 'Table',201708,201708, 1000], ['AT', 'Table',201708,201706, 1000],\
                   ['AT', 'Table',201708,201705, 1000], ['AT', 'Table',201708,201704, 1000]],\
                   columns=['ISO','Product','Billed Week', 'Created Week', 'Billings'])
print (df)

  ISO Product  Billed Week  Created Week  Billings
0  DE   Table       201705        201705      1000
1  DE   Table       201705        201704      1000
2  DE   Table       201705        201702      1000
3  DE   Table       201705        201701      1000
4  AT   Table       201708        201708      1000
5  AT   Table       201708        201706      1000
6  AT   Table       201708        201705      1000
7  AT   Table       201708        201704      1000

我需要做的是为每个groupby['ISO'，'Product']填入一些缺少的数据，其中序列中有一个中断，即在某个星期内没有创建任何账单，因此缺少。它需要基于计费周的最大值和创建周的最小值。也就是说，组合应该是完整的，没有顺序中断

因此，对于上述内容，我需要以编程方式将缺少的记录追加到数据库中，如下所示：

  ISO Product  Billed Week  Created Week  Billings
0  DE   Table       201705        201703         0
1  AT   Table       201708        201707         0

这是我的解决方案。我相信一些天才会提供更好的解决方案~让我们等待~

df1=df.groupby('ISO').agg({'Billed Week' : np.max,'Created Week' : np.min})
df1['ISO']=df1.index

     Created Week  Billed Week ISO
ISO                               
AT         201704       201708  AT
DE         201701       201705  DE

ISO=[]
BilledWeek=[]
CreateWeek=[]
for i in range(len(df1)):
    BilledWeek.extend([df1.ix[i,1]]*(df1.ix[i,1]-df1.ix[i,0]+1))
    CreateWeek.extend(list(range(df1.ix[i,0],df1.ix[i,1]+1)))
    ISO.extend([df1.ix[i,2]]*(df1.ix[i,1]-df1.ix[i,0]+1))
DF=pd.DataFrame({'BilledWeek':BilledWeek,'CreateWeek':CreateWeek,'ISO':ISO})
Target=DF.merge(df,left_on=['BilledWeek','CreateWeek','ISO'],right_on=['Billed Week','Created Week','ISO'],how='left')
Target.Billings.fillna(0,inplace=True)
Target=Target.drop(['Billed Week',  'Created Week'],axis=1)
Target['Product']=Target.groupby('ISO')['Product'].ffill()

Out[75]: 
   BilledWeek  CreateWeek ISO Product  Billings
0      201708      201704  AT   Table    1000.0
1      201708      201705  AT   Table    1000.0
2      201708      201706  AT   Table    1000.0
3      201708      201707  AT   Table       0.0
4      201708      201708  AT   Table    1000.0
5      201705      201701  DE   Table    1000.0
6      201705      201702  DE   Table    1000.0
7      201705      201703  DE   Table       0.0
8      201705      201704  DE   Table    1000.0
9      201705      201705  DE   Table    1000.0

建立一个多索引，填补创建周内的所有空白，然后重新编制原始DF的索引

idx = (df.groupby(['Billed Week'])
       .apply(lambda x: [(x['ISO'].min(),
                          x['Product'].min(),
                          x['Billed Week'].min(),
                          e) for e in range(x['Created Week'].min(), x['Created Week'].max()+1)])
       .tolist()
)

multi_idx = pd.MultiIndex.from_tuples(sum(idx,[]),names=['ISO','Product','Billed Week','Created Week'])

(df.set_index(['ISO','Product','Billed Week','Created Week'])
     .reindex(multi_idx)
     .reset_index()
     .fillna(0)
)

Out[671]: 
  ISO Product  Billed Week  Created Week  Billings
0  DE   Table       201705        201701    1000.0
1  DE   Table       201705        201702    1000.0
2  DE   Table       201705        201703       0.0
3  DE   Table       201705        201704    1000.0
4  DE   Table       201705        201705    1000.0
5  AT   Table       201708        201704    1000.0
6  AT   Table       201708        201705    1000.0
7  AT   Table       201708        201706    1000.0
8  AT   Table       201708        201707       0.0
9  AT   Table       201708        201708    1000.0

谢谢@Wen，它看起来比我想象中的黑客好4倍。稍后我将不得不检查它，并会让您知道在我的较大数据帧和此测试数据帧上是否适合我的需要。好的，如果代码不适合您，请告诉我~我无意中编辑了您的答案，但我已将其还原。对不起，艾伦，没问题！

def seqfix(x):
    s = x['Created Week']
    x = x.set_index('Created Week')
    x = x.reindex(range(min(s), max(s)+1))
    x['Billings'] = x['Billings'].fillna(0)
    x = x.ffill().reset_index()
    return x

df = df.groupby(['ISO', 'Billed Week']).apply(seqfix).reset_index(drop=True)
df[['Billed Week', 'Billings']] = df[['Billed Week', 'Billings']].astype(int)
df = df[['ISO', 'Product', 'Billed Week', 'Created Week', 'Billings']]

print(df)

  ISO Product  Billed Week  Created Week  Billings
0  AT   Table       201708        201704      1000
1  AT   Table       201708        201705      1000
2  AT   Table       201708        201706      1000
3  AT   Table       201708        201707         0
4  AT   Table       201708        201708      1000
5  DE   Table       201705        201701      1000
6  DE   Table       201705        201702      1000
7  DE   Table       201705        201703         0
8  DE   Table       201705        201704      1000
9  DE   Table       201705        201705      1000