Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/278.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫:在一系列可用值之前/之后输入给定数量的缺失值_Python_Pandas_Imputation - Fatal编程技术网

Python 熊猫:在一系列可用值之前/之后输入给定数量的缺失值

Python 熊猫:在一系列可用值之前/之后输入给定数量的缺失值,python,pandas,imputation,Python,Pandas,Imputation,假设我有一个时间序列,我通常有连续几年的数据,但在这段时间前后缺少值,如下所示: df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan,

假设我有一个时间序列,我通常有连续几年的数据,但在这段时间前后缺少值,如下所示:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)

   year  cakes eaten
0  2000          NaN
1  2001          NaN
2  2002          NaN
3  2003          3.0
4  2004          4.0
5  2005          5.0
6  2006          NaN
7  2007          NaN
   year  cakes eaten
0  2000          NaN
1  2001          1.0
2  2002          2.0
3  2003          3.0
4  2004          4.0
5  2005          5.0
6  2006          6.0
7  2007          7.0
是否有办法根据可用值中的趋势来填充(给定数量的)缺失值

假设我想在每个方向上填充最多2个值,结果必须如下所示:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
print(df)

   year  cakes eaten
0  2000          NaN
1  2001          NaN
2  2002          NaN
3  2003          3.0
4  2004          4.0
5  2005          5.0
6  2006          NaN
7  2007          NaN
   year  cakes eaten
0  2000          NaN
1  2001          1.0
2  2002          2.0
3  2003          3.0
4  2004          4.0
5  2005          5.0
6  2006          6.0
7  2007          7.0

另外:是否有办法确保只有当有足够的可用值时才执行此插补,例如,如果至少有3个可用值,我只想在每个方向上最多填充2个值(或者更一般地说,如果n+m可用,则仅填充n)?

我会使用上面提到的插值()。您可以使用各种方法来产生不同的结果。我使用
krogh
方法得到了一条线性趋势线<在两个方向填充趋势时,需要代码>限制\方向='both':

test_dict  = {'col': [np.nan, np.nan,np.nan, np.nan, np.nan, 4, 5, 6 ,np.nan]}
df = pd.DataFrame(test_dict)
df['trend'] = df['col'].interpolate(method='krogh', limit_direction='both')

    col trend
0   NaN -1.0
1   NaN 0.0
2   NaN 1.0
3   NaN 2.0
4   NaN 3.0
5   4.0 4.0
6   5.0 5.0
7   6.0 6.0
8   NaN 7.0

完成后,您可以删除0以下不需要的
趋势值。

感谢@olv1do向我展示了我想要的功能

使用interpolate和
。first\u valid\u index
。last\u valid\u index
可实现所需的行为:

#impute n values in both directions if at least m values are available
def interpolate(data, n, m):
  first_valid = data['cakes eaten'].first_valid_index()
  last_valid = data['cakes eaten'].last_valid_index()

  if(abs(first_valid - last_valid) + 1 >= m):
    data['imputed'] = data['cakes eaten'].interpolate(method='spline',order = 1, limit_direction='both', limit = n)
  return data
关于问题中的示例:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4, 5, np.nan, np.nan]})
interpolate(df, 2,3)

year    cakes eaten     imputed
0   2000    NaN     NaN
1   2001    NaN     1.0
2   2002    NaN     2.0
3   2003    3.0     3.0
4   2004    4.0     4.0
5   2005    5.0     5.0
6   2006    NaN     6.0
7   2007    NaN     7.0
如果可用值少于m,则不执行任何操作:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, np.nan, 3, 4,  np.nan, np.nan, np.nan]})
interpolate(df, 2,3)

    year    cakes eaten
0   2000    NaN
1   2001    NaN
2   2002    NaN
3   2003    3.0
4   2004    4.0
5   2005    NaN
6   2006    NaN
7   2007    NaN
此外,如果值不像我的示例中那样完全线性,则
样条线
方法也可以很好地工作:

df = pd.DataFrame({'year': ["2000","2001","2002", "2003","2004", "2005","2006", "2007"], 'cakes eaten': [np.nan, np.nan, 1, 4, 2,  3, np.nan, np.nan]})
interpolate(df, 1,4)

    year    cakes eaten     imputed
0   2000    NaN     NaN
1   2001    NaN     1.381040
2   2002    1.0     1.000000
3   2003    4.0     4.000000
4   2004    2.0     2.000000
5   2005    3.0     3.000000
6   2006    NaN     3.433167
7   2007    NaN     NaN

您将如何识别可用值中的趋势?感谢您将我重新指向插值函数,它似乎确实能够实现我想要的功能。对于我上面发布的示例,Krogh工作得非常好,但是如果趋势不是完全线性的,则会产生一些非常奇怪的值。但是,我发现使用
order=2
spline
方法效果更好