R k-均值的平均轮廓值超过了-1到+的允许范围；1._R_Binary_Cluster Analysis_K Means

R k-均值的平均轮廓值超过了-1到+的允许范围；1.

r binary

R k-均值的平均轮廓值超过了-1到+的允许范围；1.,r,binary,cluster-analysis,k-means,R,Binary,Cluster Analysis,K Means,我试图解决一个聚类问题，只在R中包含二进制自变量。我只对R有基本的了解。使用试图执行下面给出的步骤的R代码，我观察到一些初始迭代的轮廓系数超过了其允许范围。附加的是相同的快照接下来的步骤是：计算包含每对记录之间差异的距离矩阵。功能：素食主义者来自包装：素食主义者使用k-均值的距离矩阵，并多次运行k-均值，比如从1到12。函数：kmeansruns（）来自程序包：fpc 捕获每个迭代的平均轮廓宽度（asw），并确定产生最大轮廓的最佳迭代对这个“k”（从步骤3中找到）进行交叉验证，以仅使用

我试图解决一个聚类问题，只在R中包含二进制自变量。我只对R有基本的了解。使用试图执行下面给出的步骤的R代码，我观察到一些初始迭代的轮廓系数超过了其允许范围。附加的是相同的快照

接下来的步骤是：

计算包含每对记录之间差异的距离矩阵。功能：素食主义者来自包装：素食主义者

使用k-均值的距离矩阵，并多次运行k-均值，比如从1到12。函数：kmeansruns（）来自程序包：fpc

捕获每个迭代的平均轮廓宽度（asw），并确定产生最大轮廓的最佳迭代

对这个“k”（从步骤3中找到）进行交叉验证，以仅使用100次迭代和自举样本来判断集群的稳定性

我发现k均值（X轴）与asw（Y轴）中的轮廓值显示[k_vs_asw.jpeg]不一致的平均轮廓值

有人能帮忙解决这里可能出现的问题吗？或者应该使用其他聚类算法吗

附加此分析的代码和样本数据：

代码：

> ###############################################
> 
> library(vegan) library(fpc) library(reshape2) library(ggplot2)
> 
> dist <- vegdist(mydat2, method = "jaccard") clustering.asw <-
> kmeansruns(dist, krange = 1:12, criterion = "asw")
> clustering.asw$bestk
> 
> critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))
> 
> critframe <- melt(critframe, id.vars = c("k"), variable.name =
> "measure", value.name = "score")
> 
> ggplot(critframe, aes(x=k, y=score, color=measure)) +  
> geom_point(aes(shape=measure)) + geom_line(aes(linetype=measure)) +  
> scale_x_continuous(breaks=1:12, labels=1:12)
> 
> summary(clustering.asw)
> 
> kbest.p <- 2
> 
> cboot <- clusterboot(dist, clustermethod = kmeansCBI, runs = 100,
> iter.max = 100, krange=kbest.p, seed = 12345) groups <-
> cboot$result$partition
> 
> print(cboot$result$partition, kbest.p)
> 
> cboot$bootmean
> 
> cboot$bootbrd
> 
> ####################################################

>###############################################
> 
>图书馆（素食主义者）图书馆（fpc）图书馆（重塑2）图书馆（ggplot2）
> 
>dist clustering.asw$bestk
> 
>critframe
>critframe“measure”，value.name=“score”）
> 
>ggplot（critframe，aes（x=k，y=score，color=measure））+
>几何点（aes（形状=测量））+几何线（aes（线型=测量））+
>比例x连续（打断=1:12，标签=1:12）
> 
>摘要（clustering.asw）
> 
>kbest.p
>cboot iter.max=100，krange=kbest.p，seed=12345）组cboot$result$partition
> 
>打印（cboot$result$partition，kbest.p）
> 
>cboot$bootmean
> 
>cboot$bootbrd
> 
> ####################################################

样本数据：

ID V1 V2 V3 V4 V5 1 0 1 0 1 0 2 0 1 0 0 1 3 0 0 0 0 0 4 1 0 0 1 0 5 1 0 1 1 0 6 0 1 0 0 0 7 0 0 0 0 0 8 0 0 0 0 1 9 0 0 1 0 0 10 0 1 0 1 0 11 0 0 0 0 0 12 1 0 0 0 1 13 1 0 0 0 0 14 1 1 0 0 0 15 0 0 0 0 0 16 0 0 0 0 0 17 0 0 0 0 0 18 0 0 1 1 0 19 0 0 0 1 1 20 01 01 0

有40个这样的二元列和大约350多个观测值。

k-means不能使用距离矩阵。它只适用于平方欧几里德距离（以及在某些核空间中欧几里德距离的等价距离，其中核保留平均值）
它计算点到平均距离，而不是点到点的距离。因此，距离矩阵是无用的
然而，轮廓应该在[-1:+1]，因此您正在使用的代码中有错误-请查看代码，不要将其视为黑盒。
错误在：

critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))

critframe同意k-means不应该在不同的情况下使用差异性。在这种情况下，k-medoids会有所帮助，我知道它会从数据本身中找到具有代表性的观察结果，而不是使用随机分配的聚类中心。还是分层集群应该起作用？