python numpy加权平均数,带NaN
第一件事:这不是复制品,我将解释原因: 假设我有一个数组python numpy加权平均数,带NaN,python,numpy,Python,Numpy,第一件事:这不是复制品,我将解释原因: 假设我有一个数组 a = array([1,2,3,4]) 我想用重量来平均它 weights = [4,3,2,1] output = average(a, weights=weights) print output 2.0 嗯。所以这很简单。但现在我有了这样的东西: a = array([1,2,nan,4]) a = array([1,2,4]) weights = [4,3,1] output = average(a, weight
a = array([1,2,3,4])
我想用重量来平均它
weights = [4,3,2,1]
output = average(a, weights=weights)
print output
2.0
嗯。所以这很简单。但现在我有了这样的东西:
a = array([1,2,nan,4])
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
用通常的方法计算平均值当然会得到nan。我能避免这个吗?
原则上我想忽略NAN,所以我想有这样的东西:
a = array([1,2,nan,4])
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
首先找出项目不是
nan
的索引,然后将a
和权重的过滤版本传递给numpy。平均值
:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
正如@mtrw在评论中所建议的,在这里使用屏蔽数组而不是索引数组会更干净:
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
或者,您也可以使用MaskedArray:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
>>>将numpy作为np导入
>>>a=np.array([1,2,np.nan,4])
>>>权重=np.数组([4,3,2,1])
>>>ma=np.ma.MaskedArray(a,mask=np.isnan(a))
>>>np.ma.平均值(ma,权重=权重)
1.75
我将提供另一个解决方案,它更可扩展到更大的维度(例如在不同轴上进行平均)。附加的代码与2D数组一起工作,该数组可能包含NAN,并取轴=0的平均值
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array
# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a))
# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec
# mean_vec is vector with weighted nan-averages of array a taken along axis=0
扩展@Ashwini和@Nicolas的答案,这里有一个版本也可以处理所有数据值均为np.nan的边缘情况,该版本也可用于pandas DataFrame,而不存在与类型相关的问题:
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
weights: List[Union[float, int]]) -> np.ndarray:
""" Calculates the weighted average of `measures`' values, ex-nans.
When nans are present in `measures`' values,
the weights are recalculated based only on the weights for non-nan measures.
Note:
The calculation used is NOT the same as just ignoring nans.
For example, if we had data and weights:
data = [2, 3, np.nan]
weights = [0.5, 0.2, 0.3]
calc_wa_ignore_nan approach:
(2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
The ignoring nans approach:
(2*0.5) + (3*0.2) == 1.6
Args:
data: Multiple rows of numeric data values with `measures` as column headers.
measures: The str names of values to select from `row`.
weights: The numeric weights associated with `measures`.
Example:
>>> df = pd.DataFrame({"meas1": [1, 1],
"meas2": [2, 2],
"meas3": [3, 3],
"meas4": [np.nan, 0],
"meas5": [5, 5]})
>>> measures = ["meas2", "meas3", "meas4"]
>>> weights = [0.5, 0.2, 0.3]
>>> calc_wa_ignore_nan(df, measures, weights)
array([2.28571429, 1.6])
"""
assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
# Need to coerce type to np.float instead of python's float
# to avoid "ufunc 'isnan' not supported for the input types ..." error
data = np.array(df[measures].values, dtype=np.float64)
# Make a 2d array with the same weights for each row
# cast for safety and better errors
weights = np.array([weights, ] * data.shape[0], dtype=np.float64)
mask = np.isnan(data)
masked_data = np.ma.masked_array(data, mask=mask)
masked_weights = np.ma.masked_array(weights, mask=mask)
# np.nanmean doesn't support weights
weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
# Replace masked elements with np.nan
# otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
weighted_avgs = weighted_avgs.filled(np.nan)
return weighted_avgs
以上所有的解决方案都很好,但都无法处理权重中存在nan的情况。为此,请使用熊猫:
def weighted_average_ignoring_nan(df, col_value, col_weight):
den = 0
num = 0
for index, row in df.iterrows():
if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
den = den + row[col_weight]
num = num + row[col_weight]*row[col_value]
return num/den
+1,虽然我认为index=~np.isnan(a)
看起来更好(对于大型a
可能更快)。@mtrw这看起来肯定更好,会更新我的答案。谢谢。另一种选择是在进行平均之前使用np.nan\u to\u num(arr)
。这将用0替换任何NaN。@TirthaR用这种方法获得的零会扭曲结果。这里的一个问题是,该方法会改变数组的大小,因此在此之后的任何操作(取决于大小)都需要相应地进行更正。在这种情况下,最好使用下面@Nicolas Barbey提出的掩蔽方法。这是最好的解决方案,因为它可以与轴
参数一起使用,以评估多个列上的加权平均值,其中NAN在每个列上没有相同的指数。