python numpy加权平均数,带NaN

python numpy加权平均数,带NaN,python,numpy,Python,Numpy,第一件事:这不是复制品,我将解释原因: 假设我有一个数组 a = array([1,2,3,4]) 我想用重量来平均它 weights = [4,3,2,1] output = average(a, weights=weights) print output 2.0 嗯。所以这很简单。但现在我有了这样的东西: a = array([1,2,nan,4]) a = array([1,2,4]) weights = [4,3,1] output = average(a, weight

第一件事:这不是复制品,我将解释原因:

假设我有一个数组

a = array([1,2,3,4])
我想用重量来平均它

weights = [4,3,2,1]
output = average(a, weights=weights)
print output
     2.0
嗯。所以这很简单。但现在我有了这样的东西:

a = array([1,2,nan,4])
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
     1.75
用通常的方法计算平均值当然会得到nan。我能避免这个吗? 原则上我想忽略NAN,所以我想有这样的东西:

a = array([1,2,nan,4])
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
     1.75

首先找出项目不是
nan
的索引,然后将
a
权重的过滤版本传递给
numpy。平均值

>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
正如@mtrw在评论中所建议的,在这里使用屏蔽数组而不是索引数组会更干净:

>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75

或者,您也可以使用MaskedArray:

>>> import numpy as np >>> a = np.array([1,2,np.nan,4]) >>> weights = np.array([4,3,2,1]) >>> ma = np.ma.MaskedArray(a, mask=np.isnan(a)) >>> np.ma.average(ma, weights=weights) 1.75 >>>将numpy作为np导入 >>>a=np.array([1,2,np.nan,4]) >>>权重=np.数组([4,3,2,1]) >>>ma=np.ma.MaskedArray(a,mask=np.isnan(a)) >>>np.ma.平均值(ma,权重=权重) 1.75
我将提供另一个解决方案,它更可扩展到更大的维度(例如在不同轴上进行平均)。附加的代码与2D数组一起工作,该数组可能包含NAN,并取轴=0的平均值

a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array

# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a)) 

# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)                                                         
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec

# mean_vec is vector with weighted nan-averages of array a taken along axis=0

扩展@Ashwini和@Nicolas的答案,这里有一个版本也可以处理所有数据值均为np.nan的边缘情况,该版本也可用于pandas DataFrame,而不存在与类型相关的问题:

def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
                       weights: List[Union[float, int]]) -> np.ndarray:
    """ Calculates the weighted average of `measures`' values, ex-nans.

    When nans are present in  `measures`' values,
    the weights are recalculated based only on the weights for non-nan measures.

    Note:
        The calculation used is NOT the same as just ignoring nans.
        For example, if we had data and weights:
            data = [2, 3, np.nan]
            weights = [0.5, 0.2, 0.3]
            calc_wa_ignore_nan approach:
                (2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
            The ignoring nans approach:
                (2*0.5) + (3*0.2) == 1.6

    Args:
        data: Multiple rows of numeric data values with `measures` as column headers.
        measures: The str names of values to select from `row`.
        weights: The numeric weights associated with `measures`.

    Example:
        >>> df = pd.DataFrame({"meas1": [1, 1],
                               "meas2": [2, 2],
                               "meas3": [3, 3],
                               "meas4": [np.nan, 0],
                               "meas5": [5, 5]})
        >>> measures = ["meas2", "meas3", "meas4"]
        >>> weights = [0.5, 0.2, 0.3]
        >>> calc_wa_ignore_nan(df, measures, weights)
        array([2.28571429, 1.6])

    """
    assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
    # Need to coerce type to np.float instead of python's float
    # to avoid "ufunc 'isnan' not supported for the input types ..." error
    data = np.array(df[measures].values, dtype=np.float64)

    # Make a 2d array with the same weights for each row
    # cast for safety and better errors
    weights = np.array([weights, ] * data.shape[0], dtype=np.float64)

    mask = np.isnan(data)
    masked_data = np.ma.masked_array(data, mask=mask)
    masked_weights = np.ma.masked_array(weights, mask=mask)

    # np.nanmean doesn't support weights
    weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
    # Replace masked elements with np.nan
    # otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
    weighted_avgs = weighted_avgs.filled(np.nan)

    return weighted_avgs

以上所有的解决方案都很好,但都无法处理权重中存在nan的情况。为此,请使用熊猫:

def weighted_average_ignoring_nan(df, col_value, col_weight):
  den = 0
  num = 0
  for index, row in df.iterrows():
    if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
      den = den + row[col_weight]
      num = num + row[col_weight]*row[col_value]
  return num/den

+1,虽然我认为
index=~np.isnan(a)
看起来更好(对于大型
a
可能更快)。@mtrw这看起来肯定更好,会更新我的答案。谢谢。另一种选择是在进行平均之前使用
np.nan\u to\u num(arr)
。这将用0替换任何NaN。@TirthaR用这种方法获得的零会扭曲结果。这里的一个问题是,该方法会改变数组的大小,因此在此之后的任何操作(取决于大小)都需要相应地进行更正。在这种情况下,最好使用下面@Nicolas Barbey提出的掩蔽方法。这是最好的解决方案,因为它可以与
参数一起使用,以评估多个列上的加权平均值,其中NAN在每个列上没有相同的指数。