Python 是否有一种简单的方法来扩展/完成熊猫数据框架，以包含多列缺失的观测值？_Python_Pandas

Python 是否有一种简单的方法来扩展/完成熊猫数据框架，以包含多列缺失的观测值？

python pandas

Python 是否有一种简单的方法来扩展/完成熊猫数据框架，以包含多列缺失的观测值？,python,pandas,Python,Pandas,我有一个如下所示的数据帧： >>> df = pd.DataFrame({ 'category1': list('AABAAB'), 'category2': list('xyxxyx'), 'year': [2000, 2000, 2000, 2002, 2002, 2002], 'value': [0, 1, 0, 4, 3, 4] }) >>> df category1 category2 year value

我有一个如下所示的数据帧：

>>> df = pd.DataFrame({
    'category1': list('AABAAB'),
    'category2': list('xyxxyx'),
    'year': [2000, 2000, 2000, 2002, 2002, 2002],
    'value': [0, 1, 0, 4, 3, 4]
})

>>> df
  category1 category2  year  value
0         A         x  2000      0
1         A         y  2000      1
2         B         x  2000      0
3         A         x  2002      4
4         A         y  2002      3
5         B         x  2002      4

  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

我想扩大数据范围，将缺失的年份包括在内。例如，如果范围为

range（20002003）

，则扩展的数据帧应如下所示：

>>> df = pd.DataFrame({
    'category1': list('AABAAB'),
    'category2': list('xyxxyx'),
    'year': [2000, 2000, 2000, 2002, 2002, 2002],
    'value': [0, 1, 0, 4, 3, 4]
})

>>> df
  category1 category2  year  value
0         A         x  2000      0
1         A         y  2000      1
2         B         x  2000      0
3         A         x  2002      4
4         A         y  2002      3
5         B         x  2002      4

  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

我尝试了一种使用

pd.MultiIndex.from_product

的方法，但它创建的行不是

category1

和

category2

的有效组合（例如，

和

不应该同时出现）。使用_product中的

，然后对我的实际数据进行过滤，速度太慢，其中包括更多的组合
是否有一个更简单的解决方案可以很好地扩展

编辑
这是我最终采用的解决方案，尝试将问题概括一下：
id_cols = ['category1', 'category2']

df_out = (df.pivot_table(index=id_cols, values='value', columns='year')
            .reindex(columns=range(2000, 2003))
            .stack(dropna=False)
            .sort_index(level=-1)
            .reset_index(name='value'))

  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

让我们做堆栈
和取消堆栈

dfout=df.set_index(['year','category1','category2']).\
         value.unstack(level=0).\
         reindex(columns=range(2000,2003)).\
         stack(dropna=False).to_frame('value').\
         sort_index(level=2).reset_index()
  category1 category2  year  value
0         A         x  2000    0.0
1         A         y  2000    1.0
2         B         x  2000    0.0
3         A         x  2001    NaN
4         A         y  2001    NaN
5         B         x  2001    NaN
6         A         x  2002    4.0
7         A         y  2002    3.0
8         B         x  2002    4.0

或者：
fake = df.drop_duplicates(['category1','category2']).filter(['category1','category2'])

fake.index = [2001]*len(fake)
#merge two indexes on year    
pd.concat((df.set_index('year'),fake)).sort_index()

更新2021/01/08：
您可以使用函数从中提取流程；目前，您必须从以下位置安装最新的开发版本：
该函数通过传递包含要完成的缺失值的列列表来工作。这个想法的灵感来自于功能。由于该问题需要year
列的新值，您可以通过字典传递一个可调用的值，该函数将使用新值。
参考现有数据，用于填充缺失观测值的动词是“impute”。使用一个简单的NaN
，您将面临一个组合问题。请按预期发布您的攻击。显示中间结果与预期结果的偏差。我们是否可以将df.set_索引（['year'，'category1'，'category2']）value.unstack（level=0）
替换为df.pivot_表（索引=['category1'，'category2']，值=['value']，列=['year']）
？@Ch3steRC是的，感谢您的回复。是否set_index（[…]）value.unstack
比df.pivot_table（…）
快？@Ch3steRC对于大型df pivot_table，重塑df~的相同方法应略快于unstack