Python 如何利用间隙统计在层次聚类中找到最佳聚类数？_Python_R_Cluster Analysis_Hierarchical Clustering_Unsupervised Learning

Python 如何利用间隙统计在层次聚类中找到最佳聚类数？

python r

Python 如何利用间隙统计在层次聚类中找到最佳聚类数？,python,r,cluster-analysis,hierarchical-clustering,unsupervised-learning,Python,R,Cluster Analysis,Hierarchical Clustering,Unsupervised Learning,我想通过单链接运行分层聚类，以对具有300个特性和1500个观察值的文档进行聚类。我想找到这个问题的最佳聚类数下面的链接使用下面的代码查找具有最大间隙的簇数 #计算差距统计种子集（123） iris.scaledhcut（）函数是您发布的链接中使用的factorextra包的一部分： hcut包：额外的工厂文件计算层次聚类并剪切树说明： Computes hierarchical clustering (hclust, agnes, diana) and cut the tree

我想通过单链接运行分层聚类，以对具有300个特性和1500个观察值的文档进行聚类。我想找到这个问题的最佳聚类数

下面的链接使用下面的代码查找具有最大间隙的簇数

#计算差距统计
种子集（123）
iris.scaledhcut（）

函数是您发布的链接中使用的

factorextra

包的一部分：

hcut包：额外的工厂文件

计算层次聚类并剪切树

说明：

 Computes hierarchical clustering (hclust, agnes, diana) and cut
 the tree into k clusters. It also accepts correlation based
 distance measure methods such as "pearson", "spearman" and
 "kendall".

R还有一个内置函数，

hclust（）

，可用于执行分层聚类。但是，默认情况下，它不执行单链接聚类，因此不能简单地用

hclust

替换

hcut

但是，如果查看

clusGap（）

的帮助，您会发现可以提供一个要应用的自定义群集函数：

FUNcluster：接受第一个参数a（数据）的“函数” 类似于“x”的矩阵，第二个参数，比如k，k>=2，数字所需的群集数，并返回包含组件的“列表” 命名（或缩写为“cluster”，是长度向量确定聚类的“1:k”中整数的“n=nrow（x）” 或“n”个观察值的分组

hclust（）

函数能够执行单链接层次聚类，因此您可以执行以下操作：

cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)

cluster\u fun
cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)