在Matlab和Python环境中，具有初始质心的Kmeans给出了不同的输出_Python_Matlab_Cluster Analysis_K Means

在Matlab和Python环境中，具有初始质心的Kmeans给出了不同的输出

python matlab

在Matlab和Python环境中，具有初始质心的Kmeans给出了不同的输出,python,matlab,cluster-analysis,k-means,Python,Matlab,Cluster Analysis,K Means,Matlab和Python环境中Kmeans的输入如下： input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59, 0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 0.64] Matlab: [idx, C] = kmeans(input',3,'Start',[0.3;0

Matlab和Python环境中Kmeans的输入如下：

input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59, 0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 0.64]

Matlab:

[idx, C] = kmeans(input',3,'Start',[0.3;0.9;1.5]);

import numpy as np
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, n_init=1, init=np.array([0.3,0.9,1.5]).reshape(-1,1)).fit(np.array(input).reshape(-1, 1))
idx = kmeans.labels_
C = kmeans.cluster_centers_

输出

C = [0.596, 0.825, 1.035]

(idx==1) = 15, (idx==2) = 6, (idx==3) = 6

C = [0.430, 0.969, 0.637]

(idx==0) = 2, (idx==1) = 10, (idx==2) = 15

Python:

[idx, C] = kmeans(input',3,'Start',[0.3;0.9;1.5]);

import numpy as np
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, n_init=1, init=np.array([0.3,0.9,1.5]).reshape(-1,1)).fit(np.array(input).reshape(-1, 1))
idx = kmeans.labels_
C = kmeans.cluster_centers_

输出

C = [0.596, 0.825, 1.035]

(idx==1) = 15, (idx==2) = 6, (idx==3) = 6

C = [0.430, 0.969, 0.637]

(idx==0) = 2, (idx==1) = 10, (idx==2) = 15

显然，对于这些环境，分类在3个簇中的输出质心和输入点的数量是不同的。即使初始质心相同，这背后的原因是什么？

我编写了一个最小kmeans算法，用matlab测试数据集：

input=[1.11,0.81,0.61,0.62,0.62,1.03,1.16,0.44,0.42,0.73,0.74,0.65,0.59，
0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 
0.64];
c=[0.3；0.9；1.5]
对于ii=1:10
[~，idx]=min（abs（c输入））；%两两欧几里德距离
c=accumarray（idx'，输入，[]，@mean）%计算新的质心
结束

第一次迭代后，索引

idx

，指示每个值的最近质心，如下所示：

 2   2   2   2   2   2   2   1   1   2...

最后一个质心（

1.5

此处）永远不是最接近的值！因此，为了保留3个组，

kmeans

算法必须以某种方式计算该质心的新值（因为很难计算空集的平均值）。看起来python和matlab对此有不同的实现

如果要避免此问题，请确保每个初始质心都是数据集中至少一个元素的最接近值

例如，您可以获取数据集的前三个不同值。

这很可能是由于实现的不同。根据有几种常见的实现方法，在您提到的帖子中指出：“K-均值仅在其起始中心随机。一旦确定了初始候选中心，在该点之后就具有确定性。”因为我已经提供了初始质心位置，Kmeans应该只运行一次并产生相同的质心输出。我假设随机化只发生在选择初始质心时。谢谢@obchardon的澄清。你绝对正确，第三个质心从未被选中。我尝试了初始质心：C=[min（input）；（min（input）+max（input））/2；max（input）]；事实上，在Matlab和Python环境中，输出质心是相同的。