Python 如何利用中心极限定理绘制正态分布曲线_Python_Numpy_Matplotlib_Statistics

Python 如何利用中心极限定理绘制正态分布曲线

python numpy matplotlib statistics

Python 如何利用中心极限定理绘制正态分布曲线,python,numpy,matplotlib,statistics,Python,Numpy,Matplotlib,Statistics,我试图沿着我的中心极限数据分布得到一条正态分布曲线下面是我尝试过的实现 import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats import math # 1000 simulations of die roll n = 10000 avg = [] for i in range(1,n):#roll dice 10 times for n time

我试图沿着我的中心极限数据分布得到一条正态分布曲线

下面是我尝试过的实现

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:])

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

我得到下面的图表

您可以在底部看到红色的法线曲线

有人能告诉我为什么曲线不合适吗？

逻辑似乎是正确的

问题在于显示数据

尝试使用

normed=true

标准化第一个直方图，并为两个直方图设置相等的存储箱。大概20个箱子

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:],20,normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

我刚刚缩小了平均列表直方图

绘图：-

你差点就成功了！首先，请注意在同一轴上绘制两个直方图：

plt.hist(avg[0:])

及

因此，您可以在直方图上绘制法向密度，您使用

normed=True

参数正确地归一化了第二个图。但是，您也忘记了对第一个直方图进行标准化（

plt.hist（avg[0:]），normed=True

）

我还建议，既然您已经导入了

scipy.stats

，那么您最好使用该模块中的正态分布，而不是自己编写pdf

综上所述，我们有：

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):
    a = np.random.randint(1,7,10)
    avg.append(np.average(a))

# CHANGED: normalise this histogram too
plt.hist(avg[0:], 20, normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Use scipy.stats implementation of the normal pdf
# Plot the distribution curve
x = np.linspace(1.5, 5.5, num=100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))

这给了我以下的情节：

编辑在您询问的评论中：

我是如何在

np.linspace

是否可以在非标准化直方图上绘制标准内核

解决问题1。首先，我用肉眼选择了1.5和5.5。绘制柱状图后，我看到柱状图箱的范围在1.5到5.5之间，所以这是我们想要绘制正态分布的范围

选择此范围的一种更具程序性的方法是：

x = np.linspace(bins.min(), bins.max(), num=100)

至于问题2，是的，我们可以实现你想要的。然而，你应该知道，我们将不再绘制概率密度函数

在打印直方图时删除

normed=True

参数后：

x = np.linspace(bins.min(), bins.max(), num=100)

# Find pdf of normal kernel at mu
max_density = stats.norm.pdf(mu, mu, sigma)
# Calculate how to scale pdf
scale = count.max() / max_density

plt.plot(x, scale * stats.norm.pdf(x, mu, sigma))

这给了我以下的情节：

掷骰子是一种均匀分布的情况。从1到6的任何数字出现的概率是1/6。因此，平均值和标准偏差如下所示：

现在，CLT表示，对于足够大的n值，代码中的值为10，n抛出平均值的pdf将接近正态分布，平均值为3.5，标准偏差为1.7078/sqrt（10）

n_bins=50
pdf_from_hist，bin_edges=np.直方图（np.数组（平均值），bin=n_bins，density=True）
bin_mid_pts=np.add（bin_边[：-1]，bin_边[1:]）*0.5
断言（len（list（pdf-from-hist））==len（list（bin-mid-pts）））
预期标准=1.7078/数学sqrt（10）
预期平均值=3.5
pk_s=[]
qk_s=[]
对于范围内的i（n_箱）：
p=stat.norm.pdf（bin_mid_pts[i]，预期平均值，预期标准）
q=pdf\u from\u hist[i]
如果q，您可能需要缩放其中一个。正态分布的最大值超过1 IIRC，绘图将上升到2500。或者将正态分布缩放到最大值（约2700），或者使用ax.twinx（）可以在代码中显示吗？请稍微详细解释一下，因为我无法理解为什么会绘制2个直方图，即使我注释掉绘制a的最后一行histogram@penta正如评论所说，它绘制的是曲线，而不是直方图。图中的绿色曲线。注释掉最后一行只会删除绿色曲线Hi@Jack，谢谢你的帮助，你能解释一下你在np中使用1.5和5.5值的依据吗。linspace（1.5，5.5，num=100）同样，我们是否可以尝试在不规范直方图的情况下绘制法线曲线，以便曲线实际绘制到直方图的原始值
x = np.linspace(bins.min(), bins.max(), num=100)

# Find pdf of normal kernel at mu
max_density = stats.norm.pdf(mu, mu, sigma)
# Calculate how to scale pdf
scale = count.max() / max_density

plt.plot(x, scale * stats.norm.pdf(x, mu, sigma))

n_bins=50
pdf_from_hist, bin_edges=np.histogram(np.array(avg), bins=n_bins, density=True)
bin_mid_pts= np.add(bin_edges[:-1], bin_edges[1:])*0.5
assert(len(list(pdf_from_hist))  == len(list(bin_mid_pts)))
expected_std=1.7078/math.sqrt(10)
expected_mean=3.5
pk_s=[]
qk_s=[]
for i in range(n_bins):
    p=stat.norm.pdf(bin_mid_pts[i], expected_mean, expected_std) 
    q=pdf_from_hist[i]
    if q <= 1.0e-5:
        continue
    pk_s.append(p)
    qk_s.append(q)
#compute the kl divergence
kl_div=stat.entropy(pk_s, qk_s)
print('the pdf of the mean of the 10 throws differ from the corresponding normal dist with a kl divergence of %r' % kl_div)