Python 什么'；Welford的公式是'；批量更新的方差/Std的s算法？_Python_Algorithm_Statistics_Variance_Batch Updates

Python 什么'；Welford的公式是'；批量更新的方差/Std的s算法？

python algorithm statistics

Python 什么'；Welford的公式是'；批量更新的方差/Std的s算法？,python,algorithm,statistics,variance,batch-updates,Python,Algorithm,Statistics,Variance,Batch Updates,我想扩展Welford的在线算法，使其能够以多个数字（批量）更新，而不是一次只更新一个数字：我尝试从wiki页面更新算法，如下所示： # my attempt. def update1(existingAggregate, newValues): (count, mean, M2) = existingAggregate count += len(newValues) delta = np.sum(np.subtract(newValues, [mean] * len

我想扩展Welford的在线算法，使其能够以多个数字（批量）更新，而不是一次只更新一个数字：

我尝试从wiki页面更新算法，如下所示：

# my attempt.
def update1(existingAggregate, newValues):
    (count, mean, M2) = existingAggregate
    count += len(newValues) 
    delta = np.sum(np.subtract(newValues, [mean] * len(newValues)))
    mean += delta / count
    delta2 = np.sum(np.subtract(newValues, [mean] * len(newValues)))
    M2 += delta * delta2

    return (count, mean, M2)

# The original two functions from wikipedia.
def update(existingAggregate, newValue):
    (count, mean, M2) = existingAggregate
    count += 1 
    delta = newValue - mean
    mean += delta / count
    delta2 = newValue - mean
    M2 += delta * delta2

def finalize(existingAggregate):
    (count, mean, M2) = existingAggregate
    (mean, variance, sampleVariance) = (mean, M2/count, M2/(count - 1)) 
    if count < 2:
        return float('nan')
    else:
        return (mean, variance, sampleVariance)

注意a=（2,2.0,2.0）意味着我们有2个观察值，它们的平均值为2.0

# update one at a time.
temp = update(a, newValues[0])
result_single = update(temp, newValues[1])
print(finalize(result_single))

# update with my faulty batch function.
result_batch = update1(a, newValues)
print(finalize(result_batch))

正确的输出应该是两次应用单个数字更新的输出：

(3.0, 2.0, 2.6666666666666665)
(3.0, 2.5, 3.3333333333333335)

关于正确的差异更新，我遗漏了什么？我是否也需要以某种方式更新finalize函数

我之所以需要这样做，是因为我处理的是非常大的月度文件（具有不同数量的观察值），我需要了解年度均值和方差。

我对Python不太熟悉，所以我宁愿坚持使用数学符号

要更新平均值，您必须执行以下操作：

s = sum of new values
c = number of new values
newMean = oldMean + sum_i (newValue[i] - oldMean) / newCount

对于

M2

，您必须添加另一个总和：

newM2 = oldM2 + sum_i ((newValue[i] - newMean) * (newValue[i] - oldMean))

我不确定你是否真的在批量更新中保存了任何东西，因为你的内部仍然有一个循环。

多亏了Nico的澄清，我找到了它！问题是我对delta求和，然后相乘得到M2，但必须对delta的乘积求和。以下是能够接受单个编号和批次的正确批处理函数：

# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
def update(existingAggregate, newValues):
    if isinstance(newValues, (int, float, complex)):
        # Handle single digits.
        newValues = [newValues]

    (count, mean, M2) = existingAggregate
    count += len(newValues) 
    # newvalues - oldMean
    delta = np.subtract(newValues, [mean] * len(newValues))
    mean += np.sum(delta / count)
    # newvalues - newMeant
    delta2 = np.subtract(newValues, [mean] * len(newValues))
    M2 += np.sum(delta * delta2)

    return (count, mean, M2)

def finalize(existingAggregate):
    (count, mean, M2) = existingAggregate
    (mean, variance, sampleVariance) = (mean, M2/count, M2/(count - 1)) 
    if count < 2:
        return float('nan')
    else:
        return (mean, variance, sampleVariance)

而且它确实更快：

import timeit
x = random.sample(range(1, 10000), 1000)
# ...
b = random.sample(range(1, 10000), 1000)

start_time = timeit.default_timer()
result_batch = update(a, b)
print(f'{timeit.default_timer() - start_time:.4f}')
print(*(f'{x:.2f}' for x in finalize(result_batch)))

start_time = timeit.default_timer()
for i in b:
    a  = update1(a, i)
print(f'{timeit.default_timer() - start_time:.4f}')
print(*(f'{x:.2f}' for x in finalize(result_batch)))

结果:

0.0010
5008.36 8423224.68 8427438.40
0.0031
5008.36 8423224.68 8427438.40

谢谢，这非常有帮助！一个数字改进（以避免灾难性的取消）是在减法后求和以得到新均值，即``cOldMean=c*oldMean newMean=oldMean+sum_i（新值[i]-cOldMean）/newCount`````（特别是，我们不使用

），它确实更快：``导入timeit x=random.sample（范围（1，10000），1000）`。。。b=random.sample（范围（1，10000），1000）start_time=timeit.default_timer（）result_batch=update（a，b）print（f'{timeit.default_timer（）-start_time:.4f}'）print（*（f'{x:.2f}）finalize（result_batch））中x的开始时间=timeit{x:.2f}'用于最终确定（结果批次））``结果：``0.0010 5008.36 8423224.68 8427438.40 0.0031 5008.36 8423224.68 8427438.40```

import timeit
x = random.sample(range(1, 10000), 1000)
# ...
b = random.sample(range(1, 10000), 1000)

start_time = timeit.default_timer()
result_batch = update(a, b)
print(f'{timeit.default_timer() - start_time:.4f}')
print(*(f'{x:.2f}' for x in finalize(result_batch)))

start_time = timeit.default_timer()
for i in b:
    a  = update1(a, i)
print(f'{timeit.default_timer() - start_time:.4f}')
print(*(f'{x:.2f}' for x in finalize(result_batch)))

0.0010
5008.36 8423224.68 8427438.40
0.0031
5008.36 8423224.68 8427438.40