Python 熊猫:在一系列可用值之前/之后输入给定数量的缺失值
假设我有一个时间序列,我通常有连续几年的数据,但在这段时间前后缺少值,如下所示:Python 熊猫:在一系列可用值之前/之后输入给定数量的缺失值,python,pandas,imputation,Python,Pandas,Imputation,假设我有一个时间序列,我通常有连续几年的数据,但在这段时间前后缺少值,如下所示: df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan,
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 NaN
7 2007 NaN
year cakes eaten
0 2000 NaN
1 2001 1.0
2 2002 2.0
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 6.0
7 2007 7.0
是否有办法根据可用值中的趋势来填充(给定数量的)缺失值
假设我想在每个方向上填充最多2个值,结果必须如下所示:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 NaN
7 2007 NaN
year cakes eaten
0 2000 NaN
1 2001 1.0
2 2002 2.0
3 2003 3.0
4 2004 4.0
5 2005 5.0
6 2006 6.0
7 2007 7.0
另外:是否有办法确保只有当有足够的可用值时才执行此插补,例如,如果至少有3个可用值,我只想在每个方向上最多填充2个值(或者更一般地说,如果n+m可用,则仅填充n)?我会使用上面提到的插值()。您可以使用各种方法来产生不同的结果。我使用
krogh
方法得到了一条线性趋势线<在两个方向填充趋势时,需要代码>限制\方向='both':
test_dict = {'col': [np.nan, np.nan,np.nan, np.nan, np.nan, 4, 5, 6 ,np.nan]}
df = pd.DataFrame(test_dict)
df['trend'] = df['col'].interpolate(method='krogh', limit_direction='both')
col trend
0 NaN -1.0
1 NaN 0.0
2 NaN 1.0
3 NaN 2.0
4 NaN 3.0
5 4.0 4.0
6 5.0 5.0
7 6.0 6.0
8 NaN 7.0
完成后,您可以删除0以下不需要的
趋势值。感谢@olv1do向我展示了我想要的功能
使用interpolate和。first\u valid\u index
和。last\u valid\u index
可实现所需的行为:
#impute n values in both directions if at least m values are available
def interpolate(data, n, m):
first_valid = data['cakes eaten'].first_valid_index()
last_valid = data['cakes eaten'].last_valid_index()
if(abs(first_valid - last_valid) + 1 >= m):
data['imputed'] = data['cakes eaten'].interpolate(method='spline',order = 1, limit_direction='both', limit = n)
return data
关于问题中的示例:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.0
2 2002 NaN 2.0
3 2003 3.0 3.0
4 2004 4.0 4.0
5 2005 5.0 5.0
6 2006 NaN 6.0
7 2007 NaN 7.0
如果可用值少于m,则不执行任何操作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, np.nan, np.nan, np.nan]})
interpolate(df, 2,3)
year cakes eaten
0 2000 NaN
1 2001 NaN
2 2002 NaN
3 2003 3.0
4 2004 4.0
5 2005 NaN
6 2006 NaN
7 2007 NaN
此外,如果值不像我的示例中那样完全线性,则样条线
方法也可以很好地工作:
df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, 1, 4, 2, 3, np.nan, np.nan]})
interpolate(df, 1,4)
year cakes eaten imputed
0 2000 NaN NaN
1 2001 NaN 1.381040
2 2002 1.0 1.000000
3 2003 4.0 4.000000
4 2004 2.0 2.000000
5 2005 3.0 3.000000
6 2006 NaN 3.433167
7 2007 NaN NaN
您将如何识别可用值中的趋势?感谢您将我重新指向插值函数,它似乎确实能够实现我想要的功能。对于我上面发布的示例,Krogh工作得非常好,但是如果趋势不是完全线性的,则会产生一些非常奇怪的值。但是,我发现使用order=2
的spline
方法效果更好