Python DataFrame.interpolate()对尾部缺失数据进行外推
考虑以下示例,在该示例中,我们设置了一个示例数据集,创建了一个多索引,取消了数据帧的堆栈,然后执行一个线性插值,逐行填充:Python DataFrame.interpolate()对尾部缺失数据进行外推,python,pandas,interpolation,Python,Pandas,Interpolation,考虑以下示例,在该示例中,我们设置了一个示例数据集,创建了一个多索引,取消了数据帧的堆栈,然后执行一个线性插值,逐行填充: import pandas as pd # version 0.14.1 import numpy as np # version 1.8.1 df = pd.DataFrame({'location': ['a', 'b'] * 5, 'trees': ['oaks', 'maples'] * 5,
import pandas as pd # version 0.14.1
import numpy as np # version 1.8.1
df = pd.DataFrame({'location': ['a', 'b'] * 5,
'trees': ['oaks', 'maples'] * 5,
'year': range(2000, 2005) * 2,
'value': [np.NaN, 1, np.NaN, 3, 2, np.NaN, 5, np.NaN, np.NaN, np.NaN]})
df.set_index(['trees', 'location', 'year'], inplace=True)
df = df.unstack()
df = df.interpolate(method='linear', axis=1)
其中未堆叠的数据集如下所示:
value
year 2000 2001 2002 2003 2004
trees location
maples b NaN 1 NaN 3 NaN
oaks a NaN 5 NaN NaN 2
作为插值方法,我希望输出:
value
year 2000 2001 2002 2003 2004
trees location
maples b NaN 1 2 3 NaN
oaks a NaN 5 4 3 2
但相反,该方法产生(注意外推值):
有没有办法指示熊猫不要推断出序列中最后一个未缺失的值
编辑:
我仍然希望在pandas中看到此功能,但现在我已经在numpy中将其作为函数实现,然后使用df.apply()
修改df
。这是左
和右
参数的功能,我在pandas中遗漏了这一点
def interpolate(a, dec=None):
"""
:param a: a 1d array to be interpolated
:param dec: the number of decimal places with which each
value should be returned
:return: returns an array of integers or floats
"""
# default value is the largest number of decimal places in the input array
if dec is None:
dec = max_decimal(a)
# detect array format convert to numpy as necessary
if type(a) == list:
t = 'list'
b = np.asarray(a, dtype='float')
if type(a) in [pd.Series, np.ndarray]:
b = a
# return the row if it's all nan's
if np.all(np.isnan(b)):
return a
# interpolate
x = np.arange(b.size)
xp = np.where(~np.isnan(b))[0]
fp = b[xp]
interp = np.around(np.interp(x, xp, fp, np.nan, np.nan), decimals=dec)
# return with proper numerical type formatting
# check to make sure there aren't nan's before converting to int
if dec == 0 and np.isnan(np.sum(interp)) == False:
interp = interp.astype(int)
if t == 'list':
return interp.tolist()
else:
return interp
# two little helper functions
def count_decimal(i):
try:
return int(decimal.Decimal(str(i)).as_tuple().exponent) * -1
except ValueError:
return 0
def max_decimal(a):
m = 0
for i in a:
n = count_decimal(i)
if n > m:
m = n
return m
类似于示例数据集上的符咒:
In[1]: df.apply(interpolate, axis=1)
Out[1]:
value
year 2000 2001 2002 2003 2004
trees location
maples b NaN 1 2 3 NaN
oaks a NaN 5 4 3 2
这确实是令人费解的功能。这里有一个更紧凑的解决方案,可以在初始插值后应用
def de_extrapolate(row):
extrap = row[row==row[-1]]
if extrap.size > 1:
first_index = extrap.index[1]
row[first_index:] = np.nan
return row
一如以往,我们:
In [1]: df.interpolate(axis=1).apply(de_extrapolate, axis=1)
Out[1]:
value
year 2000 2001 2002 2003 2004
trees location
maples b NaN 1 2 3 NaN
oaks a NaN 5 4 3 2
替换以下行:
df=df.interpolate(方法='linear',轴=1)
为此:
df=df.interpolate(轴=1).where(df.bfill(轴=1).notnull())
它通过使用回填为后续NAN查找掩码。因为它执行两个NaN填充操作,所以效率不是很高,但这些问题通常不是问题。从Pandas版本0.21.0开始,
limit_area='inside'告诉
df.interpolate`只填充由有效值包围的NaN:
import pandas as pd # version 0.21.0
import numpy as np
df = pd.DataFrame({'location': ['a', 'b'] * 5,
'trees': ['oaks', 'maples'] * 5,
'year': list(range(2000, 2005)) * 2,
'value': [np.NaN, 1, np.NaN, 3, 2, np.NaN, 5, np.NaN, np.NaN, np.NaN]})
df.set_index(['trees', 'location', 'year'], inplace=True)
df = df.unstack()
df2 = df.interpolate(method='linear', axis=1, limit_area='inside')
print(df2)
屈服
value
year 2000 2001 2002 2003 2004
trees location
maples b NaN 1.0 2.0 3.0 NaN
oaks a NaN 5.0 4.0 3.0 2.0
我觉得这很奇怪,也许是个bug?值得一提!是的,可能是边缘案件。请提交一个问题。提交在问题中提供答案不是最佳方式:它会阻止投票并混淆问题。起草一个单独的答案怎么样?这应该是可以接受的答案,因为熊猫0.21.0现在已经很普遍了,这正是我们所要求的。
value
year 2000 2001 2002 2003 2004
trees location
maples b NaN 1.0 2.0 3.0 NaN
oaks a NaN 5.0 4.0 3.0 2.0