Python DataFrame.interpolate（）对尾部缺失数据进行外推_Python_Pandas_Interpolation

Python DataFrame.interpolate（）对尾部缺失数据进行外推

python pandas

Python DataFrame.interpolate（）对尾部缺失数据进行外推,python,pandas,interpolation,Python,Pandas,Interpolation,考虑以下示例，在该示例中，我们设置了一个示例数据集，创建了一个多索引，取消了数据帧的堆栈，然后执行一个线性插值，逐行填充： import pandas as pd # version 0.14.1 import numpy as np # version 1.8.1 df = pd.DataFrame({'location': ['a', 'b'] * 5, 'trees': ['oaks', 'maples'] * 5,

考虑以下示例，在该示例中，我们设置了一个示例数据集，创建了一个多索引，取消了数据帧的堆栈，然后执行一个线性插值，逐行填充：

import pandas as pd  # version 0.14.1
import numpy as np  # version 1.8.1

df = pd.DataFrame({'location': ['a', 'b'] * 5,
                   'trees': ['oaks', 'maples'] * 5,
                   'year': range(2000, 2005) * 2,
                   'value': [np.NaN, 1, np.NaN, 3, 2, np.NaN, 5, np.NaN, np.NaN, np.NaN]})
df.set_index(['trees', 'location', 'year'], inplace=True)
df = df.unstack()
df = df.interpolate(method='linear', axis=1)

其中未堆叠的数据集如下所示：

                 value                        
year              2000  2001  2002  2003  2004
trees  location                               
maples b           NaN     1   NaN     3   NaN
oaks   a           NaN     5   NaN   NaN     2

作为插值方法，我希望输出：

                 value                        
year              2000  2001  2002  2003  2004
trees  location                               
maples b           NaN     1     2     3   NaN
oaks   a           NaN     5     4     3     2

但相反，该方法产生（注意外推值）：

有没有办法指示熊猫不要推断出序列中最后一个未缺失的值

编辑：

我仍然希望在pandas中看到此功能，但现在我已经在numpy中将其作为函数实现，然后使用

df.apply（）

修改

df

。这是

左

和

右

参数的功能，我在pandas中遗漏了这一点

def interpolate(a, dec=None):
    """
    :param a: a 1d array to be interpolated
    :param dec: the number of decimal places with which each
                value should be returned
    :return: returns an array of integers or floats
    """

    # default value is the largest number of decimal places in the input array
    if dec is None:
        dec = max_decimal(a)

    # detect array format convert to numpy as necessary
    if type(a) == list:
        t = 'list'
        b = np.asarray(a, dtype='float')
    if type(a) in [pd.Series, np.ndarray]:
        b = a

    # return the row if it's all nan's
    if np.all(np.isnan(b)):
        return a

    # interpolate
    x = np.arange(b.size)
    xp = np.where(~np.isnan(b))[0]
    fp = b[xp]
    interp = np.around(np.interp(x, xp, fp, np.nan, np.nan), decimals=dec)

    # return with proper numerical type formatting
    # check to make sure there aren't nan's before converting to int
    if dec == 0 and np.isnan(np.sum(interp)) == False:
        interp = interp.astype(int)
    if t == 'list':
        return interp.tolist()
    else:
        return interp


# two little helper functions
def count_decimal(i):
    try:
        return int(decimal.Decimal(str(i)).as_tuple().exponent) * -1
    except ValueError:
        return 0


def max_decimal(a):
    m = 0
    for i in a:
        n = count_decimal(i)
        if n > m:
            m = n
    return m

类似于示例数据集上的符咒：

In[1]: df.apply(interpolate, axis=1)
Out[1]:
                 value                        
year              2000  2001  2002  2003  2004
trees  location                               
maples b           NaN     1     2     3   NaN
oaks   a           NaN     5     4     3     2

这确实是令人费解的功能。这里有一个更紧凑的解决方案，可以在初始插值后应用

def de_extrapolate(row):  
    extrap = row[row==row[-1]]    
    if extrap.size > 1:
        first_index = extrap.index[1]
        row[first_index:] = np.nan
    return row

一如以往，我们：

In [1]: df.interpolate(axis=1).apply(de_extrapolate, axis=1)
Out[1]: 
                value                    
year             2000 2001 2002 2003 2004
trees  location                          
maples b          NaN    1    2    3  NaN
oaks   a          NaN    5    4    3    2

替换以下行：

df=df.interpolate（方法='linear'，轴=1）

为此：

df=df.interpolate（轴=1）.where（df.bfill（轴=1）.notnull（））

它通过使用回填为后续NAN查找掩码。因为它执行两个NaN填充操作，所以效率不是很高，但这些问题通常不是问题。

从Pandas版本0.21.0开始，

limit_area='inside'告诉

df.interpolate`只填充由有效值包围的NaN：

import pandas as pd  # version 0.21.0
import numpy as np  

df = pd.DataFrame({'location': ['a', 'b'] * 5,
                   'trees': ['oaks', 'maples'] * 5,
                   'year': list(range(2000, 2005)) * 2,
                   'value': [np.NaN, 1, np.NaN, 3, 2, np.NaN, 5, np.NaN, np.NaN, np.NaN]})
df.set_index(['trees', 'location', 'year'], inplace=True)
df = df.unstack()

df2 = df.interpolate(method='linear', axis=1, limit_area='inside')
print(df2)

屈服

                value                    
year             2000 2001 2002 2003 2004
trees  location                          
maples b          NaN  1.0  2.0  3.0  NaN
oaks   a          NaN  5.0  4.0  3.0  2.0

我觉得这很奇怪，也许是个bug？值得一提！是的，可能是边缘案件。请提交一个问题。提交在问题中提供答案不是最佳方式：它会阻止投票并混淆问题。起草一个单独的答案怎么样？这应该是可以接受的答案，因为熊猫0.21.0现在已经很普遍了，这正是我们所要求的。

                value                    
year             2000 2001 2002 2003 2004
trees  location                          
maples b          NaN  1.0  2.0  3.0  NaN
oaks   a          NaN  5.0  4.0  3.0  2.0