Python 高斯混合模型峰值拟合(Scikit);如何从离散pdf中采样?

Python 高斯混合模型峰值拟合(Scikit);如何从离散pdf中采样?,python,scipy,scikit-learn,Python,Scipy,Scikit Learn,据我所知,scikit中的高斯混合模型期望从由几个高斯分布组成的分布中提取样本。例如: 它们使用gmm生成样本: from matplotlib import pyplot as plt import numpy as np from sklearn.mixture import GMM np.random.seed(1) gmm = GMM(3, n_iter=1) gmm.means_ = np.array([[-1], [0], [3]]) gmm.covars_ = np.array

据我所知,scikit中的高斯混合模型期望从由几个高斯分布组成的分布中提取样本。例如:

它们使用gmm生成样本:

from matplotlib import pyplot as plt
import numpy as np
from sklearn.mixture import GMM

np.random.seed(1)

gmm = GMM(3, n_iter=1)
gmm.means_ = np.array([[-1], [0], [3]])
gmm.covars_ = np.array([[1.5], [1], [0.5]]) ** 2
gmm.weights_ = np.array([0.3, 0.5, 0.2])

X = gmm.sample(1000)
本质上,它们将高斯混合模型与数据相匹配,如(伪代码):

X看起来像:

X的历史图:

plt.hist(X, 30, normed=True, histtype='stepfilled', alpha=0.4)

我的数据看起来不一样。我没有从分发中抽取样本。测量设备已经为我提供了离散概率分布(粒径):

如何转换离散pdf格式

我试过:

from scipy import stats
custm = stats.rv_discrete(name='custm', values=(x, y))
但是
custm.pmf(xk)
不会再现分布

我的数据:

x = np.array([  1.00074269e-02,   1.13692409e-02,   1.29163711e-02,
         1.46740353e-02,   1.66708829e-02,   1.89394623e-02,
         2.15167508e-02,   2.44447575e-02,   2.77712084e-02,
         3.15503239e-02,   3.58437026e-02,   4.07213258e-02,
         4.62626976e-02,   5.25581412e-02,   5.97102709e-02,
         6.78356648e-02,   7.70667650e-02,   8.75540365e-02,
         9.94684195e-02,   1.13004116e-01,   1.28381754e-01,
         1.45851987e-01,   1.65699574e-01,   1.88248029e-01,
         2.13864884e-01,   2.42967690e-01,   2.76030816e-01,
         3.13593183e-01,   3.56267049e-01,   4.04747990e-01,
         4.59826233e-01,   5.22399543e-01,   5.93487850e-01,
         6.74249878e-01,   7.66002029e-01,   8.70239845e-01,
         9.88662378e-01,   1.12319989e+00,   1.27604531e+00,
         1.44968999e+00,   1.64696429e+00,   1.87108375e+00,
         2.12570146e+00,   2.41496764e+00,   2.74359726e+00,
         3.11694690e+00,   3.54110209e+00,   4.02297646e+00,
         4.57042446e+00,   5.19236937e+00,   5.89894875e+00,
         6.70167970e+00,   7.61364654e+00,   8.64971415e+00,
         9.82677018e+00,   1.11640004e+01,   1.26832013e+01,
         1.44091357e+01,   1.63699358e+01,   1.85975622e+01,
         2.11283247e+01,   2.40034743e+01,   2.72698752e+01,
         3.09807691e+01,   3.51966426e+01,   3.99862137e+01,
         4.54275512e+01,   5.16093477e+01,   5.86323653e+01,
         6.66110774e+01,   7.56755354e+01,   8.59734879e+01,
         9.76727893e+01,   1.10964136e+02,   1.26064173e+02,
         1.43219028e+02,   1.62708322e+02,   1.84849725e+02,
         2.10004139e+02,   2.38581573e+02,   2.71047834e+02,
         3.07932115e+02,   3.49835622e+02,   3.97441372e+02,
         4.51525329e+02,   5.12969048e+02,   5.82774050e+02,
         6.62078141e+02,   7.52173958e+02,   8.54530045e+02,
         9.70814783e+02,   1.10292359e+03,   1.25300980e+03,
         1.42351980e+03,   1.61723286e+03,   1.83730645e+03,
         2.08732774e+03,   2.37137201e+03,   2.69406911e+03,
         3.06067895e+03,   3.47717718e+03])

y = array([ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.09,  0.18,  0.31,
        0.48,  0.69,  0.94,  1.21,  1.49,  1.76,  2.  ,  2.2 ,  2.36,
        2.47,  2.56,  2.63,  2.69,  2.76,  2.84,  2.91,  2.99,  3.05,
        3.1 ,  3.13,  3.14,  3.11,  3.06,  2.96,  2.81,  2.63,  2.42,
        2.21,  2.03,  1.94,  1.95,  2.09,  2.34,  2.66,  2.97,  3.18,
        3.22,  3.04,  2.64,  2.07,  1.43,  0.83,  0.36,  0.09,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ])

数据y似乎是函数f在x点处的值。它的图形类似于高斯混合的概率密度函数图。显然,您的问题不同于从数据中检测GM的参数,因为您记录了对(x,y),而不是像上面的gmm.sample(1000)中那样只记录了x值。此外,您的数据(x,y)似乎不是从2D GM中采样的。它是由激光衍射设备测量的颗粒大小的pdf。软件以某种方式计算y值(pdf格式)。我没有访问该软件的权限,只知道它是如何完成的。据我所知,GM的参数可以(仅)从样本中估计。我认为最好解决一个问题,即是否有一种从概率密度函数的离散化版本估计参数的方法,就像你的情况一样,在这里:
x = np.array([  1.00074269e-02,   1.13692409e-02,   1.29163711e-02,
         1.46740353e-02,   1.66708829e-02,   1.89394623e-02,
         2.15167508e-02,   2.44447575e-02,   2.77712084e-02,
         3.15503239e-02,   3.58437026e-02,   4.07213258e-02,
         4.62626976e-02,   5.25581412e-02,   5.97102709e-02,
         6.78356648e-02,   7.70667650e-02,   8.75540365e-02,
         9.94684195e-02,   1.13004116e-01,   1.28381754e-01,
         1.45851987e-01,   1.65699574e-01,   1.88248029e-01,
         2.13864884e-01,   2.42967690e-01,   2.76030816e-01,
         3.13593183e-01,   3.56267049e-01,   4.04747990e-01,
         4.59826233e-01,   5.22399543e-01,   5.93487850e-01,
         6.74249878e-01,   7.66002029e-01,   8.70239845e-01,
         9.88662378e-01,   1.12319989e+00,   1.27604531e+00,
         1.44968999e+00,   1.64696429e+00,   1.87108375e+00,
         2.12570146e+00,   2.41496764e+00,   2.74359726e+00,
         3.11694690e+00,   3.54110209e+00,   4.02297646e+00,
         4.57042446e+00,   5.19236937e+00,   5.89894875e+00,
         6.70167970e+00,   7.61364654e+00,   8.64971415e+00,
         9.82677018e+00,   1.11640004e+01,   1.26832013e+01,
         1.44091357e+01,   1.63699358e+01,   1.85975622e+01,
         2.11283247e+01,   2.40034743e+01,   2.72698752e+01,
         3.09807691e+01,   3.51966426e+01,   3.99862137e+01,
         4.54275512e+01,   5.16093477e+01,   5.86323653e+01,
         6.66110774e+01,   7.56755354e+01,   8.59734879e+01,
         9.76727893e+01,   1.10964136e+02,   1.26064173e+02,
         1.43219028e+02,   1.62708322e+02,   1.84849725e+02,
         2.10004139e+02,   2.38581573e+02,   2.71047834e+02,
         3.07932115e+02,   3.49835622e+02,   3.97441372e+02,
         4.51525329e+02,   5.12969048e+02,   5.82774050e+02,
         6.62078141e+02,   7.52173958e+02,   8.54530045e+02,
         9.70814783e+02,   1.10292359e+03,   1.25300980e+03,
         1.42351980e+03,   1.61723286e+03,   1.83730645e+03,
         2.08732774e+03,   2.37137201e+03,   2.69406911e+03,
         3.06067895e+03,   3.47717718e+03])

y = array([ 0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.09,  0.18,  0.31,
        0.48,  0.69,  0.94,  1.21,  1.49,  1.76,  2.  ,  2.2 ,  2.36,
        2.47,  2.56,  2.63,  2.69,  2.76,  2.84,  2.91,  2.99,  3.05,
        3.1 ,  3.13,  3.14,  3.11,  3.06,  2.96,  2.81,  2.63,  2.42,
        2.21,  2.03,  1.94,  1.95,  2.09,  2.34,  2.66,  2.97,  3.18,
        3.22,  3.04,  2.64,  2.07,  1.43,  0.83,  0.36,  0.09,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ,
        0.  ,  0.  ])