Python numpy直方图累积密度总和不等于1_Python_Numpy

Python numpy直方图累积密度总和不等于1

python numpy

Python numpy直方图累积密度总和不等于1,python,numpy,Python,Numpy,我从另一个线程（to）中得到一个提示，写道： # plot cumulative density function of nearest nbr distances # evaluate the histogram values, base = np.histogram(nearest, bins=20, density=1) #evaluate the cumulative cumulative = np.cumsum(values) # plot the cumulative functio

我从另一个线程（to）中得到一个提示，写道：

# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')

我从np.histogram上的文档中输入了密度=1，它表示：

“请注意，除非选择单位宽度的箱子，否则直方图值之和将不等于1；它不是概率质量函数。”

事实上，当绘制时，它们的总和不是1。但是，我不理解“单位宽度的箱子”。当我将箱子设置为1时，当然，我会得到一个空的图表；当我将它们设置为总体大小时，我不会将总和设置为1（更像是0.2）。当我使用建议的40个垃圾箱时，它们的总数约为.006

有人能给我一些指导吗？谢谢

你需要确保你的箱子都是1号宽。即:

np.all(np.diff(base)==1)

为此，您必须手动指定您的箱子：

bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)

你会得到：

In [18]: np.all(np.diff(base)==1)
Out[18]: True

In [19]: np.sum(values)
Out[19]: 0.99999999999999989

您可以自己简单地规范化

值

变量，如下所示：

unity\u values=values/values.sum（）

完整示例如下所示：

import numpy as np
import matplotlib.pyplot as plt

x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)

ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)

ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()

事实上，这句话

“请注意，除非选择单位宽度的箱子，否则直方图值之和将不等于1；它不是概率质量函数。”

这意味着我们得到的输出是各个箱子的概率密度函数，现在，因为在pdf中，两个值之间的概率，比如“a”和“b”，由范围“a”和“b”之间的pdf曲线下的面积表示。因此，为了获得各个箱子的概率值，我们必须将该箱子的pdf值乘以箱子宽度，然后获得的概率序列可以直接用于计算累积概率（因为它们现在已标准化）
注意，新计算的概率之和将给出1，这满足了总概率之和为1的事实，或者换句话说，我们可以说我们的概率是标准化的
参见下面的代码，这里我有不同宽度的使用箱，一些是宽度1，一些是宽度2

import numpy as np import math rng = np.random.RandomState(10) # deterministic random data a = np.hstack((rng.normal(size=1000), rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data mini=math.floor(min(a)) maxi=math.ceil(max(a)) print(mini) print(maxi) ar1=np.arange(mini,maxi/2) ar2=np.arange(math.ceil(maxi/2),maxi+2,2) ar=np.hstack((ar1,ar2)) print(ar) # ar is the array of unequal widths, which is used below to generate the bin_edges counts, bin_edges = np.histogram(a, bins=ar, density = True) print(counts) # the pdf values of respective bin_edges print(bin_edges) # the corresponding bin_edges print(np.sum(counts*np.diff(bin_edges))) #finding total sum of probabilites, equal to 1 print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.
现在我想他们之所以要说箱子的宽度应该是1，可能是因为如果箱子的宽度等于1，那么pdf的值和任何箱子的概率都是相等的，因为如果我们计算箱子下面的面积，然后我们基本上是将1乘以相应的垃圾箱的pdf，这同样等于pdf值。
因此，在这种情况下，pdf的值等于各个箱子概率的值，因此已经标准化了。
面积总和是否为1？我猜是的。保罗，对不起，我的统计数据很弱。我在一个R示例中工作，其中y轴值从0到1，CDF上限为1。（如果我知道怎么做的话，我会发布一个屏幕截图。）对于我来说，当我有来自
np.arange（01005，10）
的箱子时，我只需要将它们全部乘以10。我还没有检查过，但你似乎只需要将密度乘以差分因子，在我看来，差分因子是10！谢谢--现在曲线更像我的目标。从文档中：
如果'bins'是一个int，它定义了给定范围（默认为10）内的等宽箱子的数量
-所以OP的示例默认情况下应该可以工作，不是吗？看起来像是一个错误。彼此的宽度相等，但不一定是宽度1。啊，我明白了，它等于箱子的宽度，所以对于等宽箱子，你可以通过除以
base[1]-base[0]
得到单位。谢谢你，保罗。几天前，我确实尝试过划分（规范化）我的“最近”向量。不记得为什么我对结果不满意。可能是做错了。在
密度行中，bins=np。直方图（x，normed=True，density=True）
，为什么
normed
和
density
都设置为
True
？numpy文档说
normed
已被弃用；我问这个问题是因为我正试图通过
numpy histogram
@mikey获得一个标准化的累积直方图。这个答案是在2014年numpy弃用
normed
之前写的。你说的标准化是什么意思？有些人的意思是，在标准化直方图中，最高条的值应该是1。其他人则希望钢筋的面积总和为1。我很确定
density
得到了后者，你必须自己计算前者。@mikey（numpy文档比我解释得更好，顺便说一下）@PaulH通过归一化，我的意思是曲线下的面积总和为1。我问了一个问题。