Python 用GMM得到的不同结果_Python_Scikit Learn_Gmm

Python 用GMM得到的不同结果

python scikit-learn

Python 用GMM得到的不同结果,python,scikit-learn,gmm,Python,Scikit Learn,Gmm,我想使用GMM对经典的iris数据集进行聚类。我从以下位置获取数据集：到目前为止，我的计划如下： import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.mixture import GaussianMixture as mix from sklearn.cross_validation import StratifiedKFold def main(): data=pd

我想使用GMM对经典的iris数据集进行聚类。我从以下位置获取数据集：

到目前为止，我的计划如下：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture as mix
from sklearn.cross_validation import StratifiedKFold

def main():
    data=pd.read_csv("iris.csv",header=None)

    data=data.iloc[1:]

    data[4]=data[4].astype("category")

    data[4]=data[4].cat.codes

    target=np.array(data.pop(4))
    X=np.array(data).astype(float)


    kf=StratifiedKFold(target,n_folds=10,shuffle=True,random_state=1234)

    train_ind,test_ind=next(iter(kf))
    X_train=X[train_ind]
    y_train=target[train_ind]

    gmm_calc(X_train,"full",y_train)

def gmm_calc(X_train,cov,y_train):
    print X_train
    print y_train
    n_classes = len(np.unique(y_train))
    model=mix(n_components=n_classes,covariance_type="full")
    model.means_ = np.array([X_train[y_train == i].mean(axis=0) for i in 
 xrange(n_classes)])
    model.fit(X_train)
    y_predict=model.predict(X_train)
    print cov," ",y_train
    print cov," ",y_predict
    print (np.mean(y_predict==y_train))*100

我遇到的问题是，当我试图得到巧合的数量y_predict=y_train时，因为每次我运行这个程序，我都会得到不同的结果。例如：

首次运行：

full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full   [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2
 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0.0

第二轮：

full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full   [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
33.33333333333333

第三次运行：

full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
98.51851851851852

因此，正如您所看到的，每次跑步的结果都不同。我在互联网上找到了一些代码：

但是他们得到了火车组的完全协方差，准确率约为82%。在这种情况下，我做错了什么

谢谢

更新：我发现在互联网示例中使用了GMM而不是新的高斯混合。我还发现，在该示例中，GMM参数是以有监督的方式初始化的： classifier.means=np.array（[X\u train[y\u train==i].means（axis=0）对于X范围内的i（n_类）]）

我已经将修改后的代码放在上面，但每次运行它时它都会更改结果，但使用GMM库时不会发生这种情况。

1）GMM分类器用于拟合高斯模型的混合：高斯分量随机集中在数据点上，然后算法移动它们，直到收敛到局部最优。由于随机初始化，每次运行的结果可能不同。因此，您还必须使用

GMM

的

random\u state

参数（或者尝试设置更高的初始化次数

n\u init

，并期望得到更相似的结果。）

2）发生精度问题的原因是

GMM

（与

kmeans

相同）正好适合

高斯分布，并报告每个点所属的高斯分量“数”；每次跑步时，这个数字都不同。您可以在预测中看到，集群是相同的，但它们的标签是交换的：（1,2,0）->（1,0,2）->（0,1,2），最后一个组合与适当的类一致，因此您得到98%的分数。如果你把它们画出来，你会发现在这种情况下，高斯函数本身倾向于保持不变。考虑到这一点，您可以使用多种方法：

>>> [round(i,5) for i in  (metrics.homogeneity_score(y_predict, y_train),
 metrics.completeness_score(y_predict, y_train),
 metrics.v_measure_score(y_predict,y_train),
 metrics.adjusted_rand_score(y_predict, y_train),
 metrics.adjusted_mutual_info_score(y_predict,  y_train))]
[0.86443, 0.8575, 0.86095, 0.84893, 0.85506]

用于打印的代码，从，请注意，不同版本的代码不同，如果使用旧版本，则需要替换生成省略号功能：

model = mix(n_components=len(np.unique(y_train)), covariance_type="full", verbose=0, n_init=100)
X_train = X_train.astype(float)
model.fit(X_train)
y_predict = model.predict(X_train)

import matplotlib as mpl
import matplotlib.pyplot as plt

def make_ellipses(gmm, ax):
    for n, color in enumerate(['navy', 'turquoise', 'darkorange']):
        if gmm.covariance_type == 'full':
            covariances = gmm.covariances_[n][:2, :2]
        elif gmm.covariance_type == 'tied':
            covariances = gmm.covariances_[:2, :2]
        elif gmm.covariance_type == 'diag':
            covariances = np.diag(gmm.covariances_[n][:2])
        elif gmm.covariance_type == 'spherical':
            covariances = np.eye(gmm.means_.shape[1]) * gmm.covariances_[n]
        v, w = np.linalg.eigh(covariances)
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan2(u[1], u[0])
        angle = 180 * angle / np.pi  # convert to degrees
        v = 2. * np.sqrt(2.) * np.sqrt(v)
        ell = mpl.patches.Ellipse(gmm.means_[n, :2], v[0], v[1],
                                  180 + angle, color=color)
        ell.set_clip_box(ax.bbox)
        ell.set_alpha(0.5)
        ax.add_artist(ell)


def plot(model, X, y, y_predict):

    h = plt.subplot(1, 1, 1)
    plt.subplots_adjust(bottom=.01, top=0.95, hspace=.15, wspace=.05,
                    left=.01, right=.99)
    make_ellipses(model, h)
    for n, color in enumerate( ['navy', 'turquoise', 'darkorange']):
        plt.scatter(X[y == n][:,0], X[y == n][:,1],  color=color,marker='x')
        plt.text(0.05, 0.9, 'Accuracy: %.1f' % ((np.mean(y_predict == y)) * 100),
                 transform=h.transAxes)

    plt.show()
plot(model, X_train, y_train, y_predict)

你的提问已经很晚了。可能对其他人有益。

正如@hellpanderr发布的，在GMM“random_state=1”中使用“

谢谢@hellpanderr抱歉问这个问题，但是您能为您显示的图形添加代码吗？