python numpy加权平均数，带NaN_Python_Numpy

python numpy加权平均数，带NaN

python numpy

python numpy加权平均数，带NaN,python,numpy,Python,Numpy,第一件事：这不是复制品，我将解释原因：假设我有一个数组 a = array([1,2,3,4]) 我想用重量来平均它 weights = [4,3,2,1] output = average(a, weights=weights) print output 2.0 嗯。所以这很简单。但现在我有了这样的东西： a = array([1,2,nan,4]) a = array([1,2,4]) weights = [4,3,1] output = average(a, weight

第一件事：这不是复制品，我将解释原因：

假设我有一个数组

a = array([1,2,3,4])

我想用重量来平均它

weights = [4,3,2,1]
output = average(a, weights=weights)
print output
     2.0

嗯。所以这很简单。但现在我有了这样的东西：

a = array([1,2,nan,4])

a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
     1.75

用通常的方法计算平均值当然会得到nan。我能避免这个吗？原则上我想忽略NAN，所以我想有这样的东西：

a = array([1,2,nan,4])

a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
     1.75

首先找出项目不是

nan

的索引，然后将

和

权重的过滤版本传递给numpy。平均值
：
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75

正如@mtrw在评论中所建议的，在这里使用屏蔽数组而不是索引数组会更干净：
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75

或者，您也可以使用MaskedArray：
>>> import numpy as np

>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
>>>将numpy作为np导入
>>>a=np.array（[1,2，np.nan，4]）
>>>权重=np.数组（[4,3,2,1]）
>>>ma=np.ma.MaskedArray（a，mask=np.isnan（a））
>>>np.ma.平均值（ma，权重=权重）
1.75
我将提供另一个解决方案，它更可扩展到更大的维度（例如在不同轴上进行平均）。附加的代码与2D数组一起工作，该数组可能包含NAN，并取轴=0的平均值
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array

# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a)) 

# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)                                                         
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec

# mean_vec is vector with weighted nan-averages of array a taken along axis=0

扩展@Ashwini和@Nicolas的答案，这里有一个版本也可以处理所有数据值均为np.nan的边缘情况，该版本也可用于pandas DataFrame，而不存在与类型相关的问题：
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
                       weights: List[Union[float, int]]) -> np.ndarray:
    """ Calculates the weighted average of `measures`' values, ex-nans.

    When nans are present in  `measures`' values,
    the weights are recalculated based only on the weights for non-nan measures.

    Note:
        The calculation used is NOT the same as just ignoring nans.
        For example, if we had data and weights:
            data = [2, 3, np.nan]
            weights = [0.5, 0.2, 0.3]
            calc_wa_ignore_nan approach:
                (2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
            The ignoring nans approach:
                (2*0.5) + (3*0.2) == 1.6

    Args:
        data: Multiple rows of numeric data values with `measures` as column headers.
        measures: The str names of values to select from `row`.
        weights: The numeric weights associated with `measures`.

    Example:
        >>> df = pd.DataFrame({"meas1": [1, 1],
                               "meas2": [2, 2],
                               "meas3": [3, 3],
                               "meas4": [np.nan, 0],
                               "meas5": [5, 5]})
        >>> measures = ["meas2", "meas3", "meas4"]
        >>> weights = [0.5, 0.2, 0.3]
        >>> calc_wa_ignore_nan(df, measures, weights)
        array([2.28571429, 1.6])

    """
    assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
    # Need to coerce type to np.float instead of python's float
    # to avoid "ufunc 'isnan' not supported for the input types ..." error
    data = np.array(df[measures].values, dtype=np.float64)

    # Make a 2d array with the same weights for each row
    # cast for safety and better errors
    weights = np.array([weights, ] * data.shape[0], dtype=np.float64)

    mask = np.isnan(data)
    masked_data = np.ma.masked_array(data, mask=mask)
    masked_weights = np.ma.masked_array(weights, mask=mask)

    # np.nanmean doesn't support weights
    weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
    # Replace masked elements with np.nan
    # otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
    weighted_avgs = weighted_avgs.filled(np.nan)

    return weighted_avgs

以上所有的解决方案都很好，但都无法处理权重中存在nan的情况。为此，请使用熊猫：
def weighted_average_ignoring_nan(df, col_value, col_weight):
  den = 0
  num = 0
  for index, row in df.iterrows():
    if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
      den = den + row[col_weight]
      num = num + row[col_weight]*row[col_value]
  return num/den

+1，虽然我认为index=~np.isnan（a）
看起来更好（对于大型a
可能更快）。@mtrw这看起来肯定更好，会更新我的答案。谢谢。另一种选择是在进行平均之前使用np.nan\u to\u num（arr）
。这将用0替换任何NaN。@TirthaR用这种方法获得的零会扭曲结果。这里的一个问题是，该方法会改变数组的大小，因此在此之后的任何操作（取决于大小）都需要相应地进行更正。在这种情况下，最好使用下面@Nicolas Barbey提出的掩蔽方法。这是最好的解决方案，因为它可以与轴
参数一起使用，以评估多个列上的加权平均值，其中NAN在每个列上没有相同的指数。