Python:带参数的加权中值算法

Python:带参数的加权中值算法,python,algorithm,pandas,Python,Algorithm,Pandas,我有一个如下所示的数据帧: Out[14]: impwealth indweight 16 180000 34.200 21 384000 37.800 26 342000 39.715 30 1154000 44.375 31 421300 44.375 32 1210000 45.295 33 1062500 45.295 34 1878000 46.653 35

我有一个如下所示的数据帧:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476
# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']
我想使用
indweight
中的频率权重计算列
impwealth
的加权中值。我的伪代码如下所示:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476
# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']
这个方法看起来很笨拙,我不确定它是否正确。在参考资料中,我没有找到一种内置的方法来实现这一点。寻找加权中值的最佳方法是什么?

您试过这个软件包吗?我以前从未使用过它,但它有一个加权中值函数,似乎至少给出了一个合理的答案(你可能想再次检查它是否使用了你期望的方法)


您也可以使用我为相同目的编写的函数

注意:加权使用末尾的插值选择0.5分位数(您可以自己查看代码)

我编写的函数只返回一个限定为0.5权重的函数

import numpy as np

def weighted_median(values, weights):
    ''' compute the weighted median of values list. The 
weighted median is computed as follows:
    1- sort both lists (values and weights) based on values.
    2- select the 0.5 point from the weights and return the corresponding values as results
    e.g. values = [1, 3, 0] and weights=[0.1, 0.3, 0.6] assuming weights are probabilities.
    sorted values = [0, 1, 3] and corresponding sorted weights = [0.6,     0.1, 0.3] the 0.5 point on
    weight corresponds to the first item which is 0. so the weighted     median is 0.'''

    #convert the weights into probabilities
    sum_weights = sum(weights)
    weights = np.array([(w*1.0)/sum_weights for w in weights])
    #sort values and weights based on values
    values = np.array(values)
    sorted_indices = np.argsort(values)
    values_sorted  = values[sorted_indices]
    weights_sorted = weights[sorted_indices]
    #select the median point
    it = np.nditer(weights_sorted, flags=['f_index'])
    accumulative_probability = 0
    median_index = -1
    while not it.finished:
        accumulative_probability += it[0]
        if accumulative_probability > 0.5:
            median_index = it.index
            return values_sorted[median_index]
        elif accumulative_probability == 0.5:
            median_index = it.index
            it.iternext()
            next_median_index = it.index
            return np.mean(values_sorted[[median_index, next_median_index]])
        it.iternext()

    return values_sorted[median_index]
#compare weighted_median function and np.median
print weighted_median([1, 3, 0, 7], [2,3,3,9])
print np.median([1,1,0,0,0,3,3,3,7,7,7,7,7,7,7,7,7])

如果你想在纯熊猫身上做到这一点,这里有一个方法。它也不插值。(@svenkatesh,您的伪代码中缺少累计金额)


这给出了925000的中位数。

此函数概括了校对员的解决方案:

def weighted_median(df, val, weight):
    df_sorted = df.sort_values(val)
    cumsum = df_sorted[weight].cumsum()
    cutoff = df_sorted[weight].sum() / 2.
    return df_sorted[cumsum >= cutoff][val].iloc[0]
在这个例子中,它是加权的中位数(df,'impwealth','indweight')

你可以用来:

def加权分位数(值、分位数、样本加权=无、,
值(排序=False,旧样式=False):
“”“非常接近numpy.percentile,但支持权重。
注意:分位数应该在[0,1]中!
:param values:numpy.array带数据
:param quantiles:类似数组,需要许多分位数
:param sample_weight:与'array'长度相同的类似数组`
:param values_sorted:bool,如果为True,则将避免对
初始数组
:param old_style:如果为真,将更正输出以保持一致
使用numpy.percentile。
:return:numpy.array和计算分位数。
"""
values=np.array(值)
分位数=np.数组(分位数)
如果样本重量为无:
样本重量=np.单位(长度(值))
样本权重=np.数组(样本权重)

断言np.all(分位数>=0)和np.all(分位数您还可以使用库计算加权中值:

有一个软件包,可通过
conda
pip
获得,该软件包具有
加权中位数

假设您正在从终端(Mac/Linux)或Anaconda提示符(Win)使用
conda

conda激活您的环境
conda安装-c conda forge-y weightedstats
-y
的意思是“不要要求我确认更改,只要做就行了”)

然后在Python代码中:

将熊猫作为pd导入
将weightedstats导入为ws
df=pd.read_csv('/your/data/file.csv')
加权中值(df['values\u col'],df['weights\u col']))
我不确定它是否在所有情况下都能工作,但我只是将一些简单的数据与R包matrixStats中的函数
weightedMedian()
进行了比较,结果都是一样的


附言:顺便说一句,使用
weightedstats
您也可以计算
加权平均值()
,尽管NumPy也可以:

np.平均值(df['values\u col',weights=df['weights\u col']))

你确定你的伪代码是正确的吗?
df['indweight'].sum()*(.5)
将给出一个值
219你的
indweight
值都没有超过。调用
df['indweight'].median()
给出44.835,
mean()
给出43.783我想是的。
df['indweight'].sum()(.5)
应该计算数据中第50个百分位以下的观察数,因为
indweight
是一个频率权重。因此,
indweight
的平均值和中位数超过其总和是有意义的。@svenkatesh,您需要使用
.cumsum()
indweight
,而不是
indweight
本身。请看下面我的答案。如果您查看代码,加权中值函数与公认的答案非常相似,但不会在末尾进行插值。就我个人而言,我对安装一个程序包有点谨慎,因为只需几行代码即可,但如果您需要插值权重ted medians,也许这是最好的方法。你介意解释一下下面这行是怎么做的吗:加权分位数=np.cumsum(样本权重)-0.5*样本权重假设我们有值
[3,10,12]
和相关权重
[0.2,0.5,0.3]
(从前面的行排序)。
np.cumsum
将产生
[0.2,0.7,1.0]
,但这些实际上是相关分位数的右边缘。为了使它们居中,我们从每个桶中减去一半重量,得到
[0.1,0.45,0.85]
。这就是我们插值得到加权分位数的方法。非常感谢。还有一个问题(如果这是愚蠢的,请抱歉)为什么要将分位数居中?假设只有两个值
(3,4)
;它们各自的分位数应该是
(0,0.5)
(0.5,1)
(0.25,0.75)
,还是
(0,1)
?前两个是有问题的,因为它们是不对称的。第三个是此函数默认的功能,第四个是
numpy.percentile
的功能,可以在此处使用
old_style=True
arg激活。默认的一个优点是,如果从分位数采样,则获取的机会非零g观察值,例如分位数0-0.25为
3
。然而,梯形分布可能不如
旧式
平距离直观。
def weighted_quantile(values, quantiles, sample_weight=None, 
                      values_sorted=False, old_style=False):
    """ Very close to numpy.percentile, but supports weights.
    NOTE: quantiles should be in [0, 1]!
    :param values: numpy.array with data
    :param quantiles: array-like with many quantiles needed
    :param sample_weight: array-like of the same length as `array`
    :param values_sorted: bool, if True, then will avoid sorting of
        initial array
    :param old_style: if True, will correct output to be consistent
        with numpy.percentile.
    :return: numpy.array with computed quantiles.
    """
    values = np.array(values)
    quantiles = np.array(quantiles)
    if sample_weight is None:
        sample_weight = np.ones(len(values))
    sample_weight = np.array(sample_weight)
    assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
        'quantiles should be in [0, 1]'

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        sample_weight = sample_weight[sorter]

    weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
    if old_style:
        # To be convenient with numpy.percentile
        weighted_quantiles -= weighted_quantiles[0]
        weighted_quantiles /= weighted_quantiles[-1]
    else:
        weighted_quantiles /= np.sum(sample_weight)
    return np.interp(quantiles, weighted_quantiles, values)
import numpy as np
import robustats # pip install robustats


# Weighted Median
x = np.array([1.1, 5.3, 3.7, 2.1, 7.0, 9.9])
weights = np.array([1.1, 0.4, 2.1, 3.5, 1.2, 0.8])

weighted_median = robustats.weighted_median(x, weights)

print("The weighted median is {}".format(weighted_median))