Python中大数的高斯核密度估计（KDE）_Python_Statistics_Matplotlib_Scipy

Python中大数的高斯核密度估计（KDE）

python statistics matplotlib

Python中大数的高斯核密度估计（KDE）,python,statistics,matplotlib,scipy,Python,Statistics,Matplotlib,Scipy,我有1000个大数字，随机分布在37231到56661之间我正在尝试使用stats.gaussian_kde但是有些东西不起作用。（可能是因为我的统计知识太差？）代码如下： from scipy import stats.gaussian_kde import matplotlib.pyplot as plt # 'data' is a 1D array that contains the initial numbers 37231 to 56661 xmin = min(data) x

我有1000个大数字，随机分布在37231到56661之间

我正在尝试使用stats.gaussian_kde但是有些东西不起作用。（可能是因为我的统计知识太差？）

代码如下：

from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt

# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)   

# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000)   # get 1000 points on x axis
nPoints = len(x)

# get actual kernel density.
density = gaussian_kde(data)
y = density(x)

# print the output data
for i in range(nPoints):
    print "%s   %s" % (x[i], y[i])

plt.plot(x, density(x))
plt.show()

在打印输出中，我在第1列中得到x值，在第2列中得到零。图中显示了一条平线

我根本找不到解决办法。我尝试了非常广泛的X-E，同样的结果

有什么问题？我做错了什么？

大量的数字可能是原因吗？

我认为发生的情况是，您的数据数组由整数组成，这会导致以下问题：

>>> import numpy, scipy.stats
>>> 
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>> 
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
        44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

但如果我们使用浮动：

>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605.,  45451.,  46029.,  40379.,  48885.,  41262.,  39248.,
         38247.,  55987.,  44019.]])
>>> density(x)
array([  4.42201513e-05,   5.51130237e-05,   5.94470211e-05,
         5.78485526e-05,   5.21379448e-05,   4.43176188e-05,
         3.66725694e-05,   3.06297511e-05,   2.56191024e-05,
         2.01305127e-05])

我做了一个函数来做这个。您可以将带宽作为函数的参数进行更改。也就是说，数字越小=越尖，数字越大=越平滑。默认值为0.3

它在IPython笔记本中工作——pylab=inline

箱子的数量经过优化和编码，因此会随数据中变量的数量而变化

import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np

def hist_with_kde(data, bandwidth = 0.3):
    #set number of bins using Freedman and Diaconis
    q1 = np.percentile(data,25)
    q3 = np.percentile(data,75)


    n = len(data)**(.1/.3)
    rng = max(data) - min(data)
    iqr = 2*(q3-q1)
    bins = int((n*rng)/iqr)

    x = np.linspace(min(data),max(data),200)

    kde = stats.gaussian_kde(data)
    kde.covariance_factor = lambda : bandwidth
    kde._compute_covariance()

    plt.plot(x,kde(x),'r') # distribution function
    plt.hist(data,bins=bins,normed=True) # histogram

data = np.random.randn(500)
hist_with_kde(data,0.25)

注意顶部附近的格式错误；您可以选择所有代码，然后点击

{}

按钮，在每行前面添加必要的四个空格。@sarnold，对不起，您是说哪个错误？我实际上使用了{}按钮，至少在我的Mac上，格式看起来很好。（我是这里的新手，但我为这个错误提前道歉）@Proteos：看第一行，从“从scipy导入…”开始。它没有标记为代码。哈！这将教会我在不看源代码的情况下给出建议；您必须在源代码之前留下一个空行。这有点傻，但你是对的，代码都在那里…哦，多天真的错误啊！我以为我错过了一些简单的东西，但就这么简单另一方面，gaussian_kde（）函数应该负责转换为float；至少给出警告，它需要浮动。你不同意吗？好了，案子解决了！多谢各位@佩特罗斯：我同意，这看起来像一只虫子。抓得好。我可以看到这个论点的两种方式：返回结果时，小数点后的位数与输入数据的位数一样多，

1e-05

非常接近

，可用于多种用途，尤其是当输入距离

很远时。这确实令人惊讶。这在scipy中已经修复，并将在下一个版本（scipy 0.10）中转换为浮动