加权kmr_R_Machine Learning_Cluster Analysis_K Means

加权kmr

r machine-learning

加权kmr,r,machine-learning,cluster-analysis,k-means,R,Machine Learning,Cluster Analysis,K Means,我想在具有三个变量（列）的数据集（即样本_数据）上进行Kmeans聚类，如下所示： A B C 1 12 10 1 2 8 11 2 3 14 10 1 . . . . . . . . . . . . 通常，在缩放列并确定集群数量后，我将在R中使用此函数： Sample_Data <- scale(Sample_Data) output_kmeans <- kmeans(Sample_Data, centers

我想在具有三个变量（列）的数据集（即样本_数据）上进行Kmeans聚类，如下所示：

     A  B  C
1    12 10 1
2    8  11 2
3    14 10 1
.    .   .  .
.    .   .  .
.    .   .  .

通常，在缩放列并确定集群数量后，我将在R中使用此函数：

Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)

Sample_Data您必须使用kmeans加权聚类，如flexclust
包中所示：

功能
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)

在数据矩阵上执行k均值聚类、硬竞争学习或神经遗传算法。
权重
拟合过程中使用的可选权重向量。只有与艰苦的竞争性学习相结合才有效
使用iris数据的玩具示例：
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl  
    kcca object of family ‘kmeans’ 

    call:
    cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)

    cluster sizes:

     1  2  3 
    50 59 41 

库（flexclust）
数据（iris）
cl如果要增加变量（列）的权重，只需将其乘以常数c>1即可
证明这会增加SSQ优化目标的权重是很平常的。
我也有同样的问题，这里的答案对我来说并不令人满意
我们都想要的是R中的观察加权k-均值聚类。我们问题的一个可读性好的例子是以下链接：
然而，使用flexclust软件包的解决方案并不满足简单的b/c模式。使用的算法不是“标准”k-means算法，而是“硬竞争学习”算法。上述和包装说明中对差异进行了详细描述
我查看了许多站点，但没有在R中找到任何解决方案/包，以便使用加权观测值执行“标准”k-均值算法。我还想知道为什么flexclust包明确地不支持标准k-means算法的权重。如果有人对此有任何解释，请随时分享
因此，基本上有两种选择：首先，重写flexclust算法，以在标准方法中启用权重。或者，第二，您可以将加权簇质心估计为起始质心，并仅通过一次迭代执行标准的k-均值算法，然后计算新的加权簇质心，并通过一次迭代执行k-均值，依此类推，直到达到收敛
我使用了第二种选择b/c，这对我来说更简单。我用的是data.table包，希望大家熟悉
rm(list=ls())

library(data.table)

### gen dataset with sample-weights
dataset     <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)] 
dataset[, Species := NULL]


### initial hclust for estimating weighted centroids
clustering    <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'), 
                        method = 'ward.D2')
no_of_clusters <- 4


### estimating starting centroids (weighted)
weighted_centroids  <- matrix(NA, nrow = no_of_clusters, 
                              ncol =  ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
 weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k = 
                                                    no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
 }


### performing weighted k-means as explained in my post
iter            <- 0 
cluster_i       <- 0
cluster_iminus1 <- 1

## while loop: if number of iteration is smaller than 50 and cluster_i (result of 
## current iteration) is not identical to cluster_iminus1 (result of former 
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){

  # update iteration  
  iter <- iter + 1

  # k-means with weighted centroids and one iteration (may generate warning messages 
  # as no convergence is reached)
  cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster

  # estimating new weighted centroids
  weighted_centroids <- matrix(NA, nrow = no_of_clusters, 
                               ncol=ncol(dataset[,c(1:4)]))
  for (i in (1:no_of_clusters))
{
 weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k = 
                                                    no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
 }

  # update cluster_i and cluster_iminus1
  if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
  cluster_i <- cluster_kmeans

}


## merge final clusters to data table
dataset[, cluster := cluster_i]

rm（list=ls（））
库（数据表）
###具有样本权重的gen数据集
谢谢，但它提到，权重只适用于艰苦的竞争性学习。它和Kmeans一样吗？你知道在模型中插入权重的手动格式吗？例如，在加权平均法中，我们可以使用R中的命令，或者我们可以手动进行计算。您知道将权重插入kmeans模型背后的逻辑吗？谢谢您的好意和解释。作为最后一个问题（希望如此），对权重是否有任何约束？例如总和（权重）=1或任何东西？你能给我介绍一个参考资料，让我更深入地研究如何为变量分配权重吗？这个函数对权重没有约束。你可以在这里找到一份申请：谢谢。在什么阶段应该这样做？比例尺后还是比例尺前？你能说出一个参考吗？比例=权重。所以，不要盲目地应用一些你在一些例子中发现的随机缩放函数。取而代之的是选择合适的权重。没有这方面的参考，从目标函数中可以明显看出，您需要为每个属性选择适当的比例。假设我们有三个变量：货币价值（范围在1000到10^6之间）频率（范围在1到10之间）延迟（范围在1到250之间），您认为我不应该对它们进行比例调整吗？或者，找到合适的天平的方法是什么？如果你想给它们称重，天平是多余的。只需为变量选择较小的权重，否则将主导结果。但要以明智的方式来做。首先将它们扩展到[0；1]没有任何好处，这只是做了两次而已。请注意，通常情况下，当您拥有如此不同比例的轴时，结果将非常无用。SSQ目标通常缺乏任何真正的相关性，而您的集群优化了一个无用的数量。我相信我必须进行更多的研究。顺便说一下，谢谢你的帮助