使用gower距离-hclust（）和philentropy:：distance（）的分层聚类_R_Cluster Analysis

使用gower距离-hclust（）和philentropy:：distance（）的分层聚类

使用gower距离-hclust（）和philentropy:：distance（）的分层聚类,r,cluster-analysis,R,Cluster Analysis,我有一个混合数据集（分类变量和连续变量），我想使用Gower距离进行层次聚类我的代码基于中的一个示例，它使用base Rdist（）表示欧几里德距离。由于dist（）不计算Gower距离，我尝试使用philentropy:：distance（）来计算它，但它不起作用谢谢你的帮助 # Data data("mtcars") mtcars$cyl <- as.factor(mtcars$cyl) # Hierarchical clustering with Euclidean dista

我有一个混合数据集（分类变量和连续变量），我想使用Gower距离进行层次聚类

我的代码基于中的一个示例，它使用base R

dist（）

表示欧几里德距离。由于

dist（）

不计算Gower距离，我尝试使用

philentropy:：distance（）

来计算它，但它不起作用

谢谢你的帮助

# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)

# Hierarchical clustering with Euclidean distance - works 
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)

# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)

#数据
数据（“mtcars”）
mtcars$cyl错误在distance
函数本身
我不知道这是有意还是无意，但当前使用“gower”方法实现的philentropy:：distance
无法处理任何混合数据类型，因为第一个操作是转置data.frame，生成一个字符矩阵，当传递到DistMatrixWithoutUnit
函数时抛出键入错误
您可以尝试使用cluster
中的daisy
函数
library(cluster)

x <- mtcars[,1:2]

x$cyl <- as.factor(x$cyl)

dist <- daisy(x, metric = "gower")

cls <- hclust(dist)

plot(cls)

库（集群）
x错误在距离
函数本身
我不知道这是有意还是无意，但当前使用“gower”方法实现的philentropy:：distance
无法处理任何混合数据类型，因为第一个操作是转置data.frame，生成一个字符矩阵，当传递到DistMatrixWithoutUnit
函数时抛出键入错误
您可以尝试使用cluster
中的daisy
函数
library(cluster)

x <- mtcars[,1:2]

x$cyl <- as.factor(x$cyl)

dist <- daisy(x, metric = "gower")

cls <- hclust(dist)

plot(cls)

库（集群）
xLLL；
对不起，我不懂英语，也无法解释。现在这是一次尝试。
但是代码是好的；-）
库（费城熵）
簇LLL；
对不起，我不懂英语，也无法解释。现在这是一次尝试。
但是代码是好的；-）
库（费城熵）
集群使用gower
软件包，您可以非常高效地完成这项工作
library(gower)

d <- sapply(1:nrow(mtcars), function(i) gower_dist(mtcars[i,],mtcars))
d <- as.dist(d)
h <- hclust(d)
plot(h)

库（高尔）
d使用gower
软件包，您可以非常高效地完成这项工作
library(gower)

d <- sapply(1:nrow(mtcars), function(i) gower_dist(mtcars[i,],mtcars))
d <- as.dist(d)
h <- hclust(d)
plot(h)

库（高尔）
d非常感谢这个伟大的问题，也感谢所有提供了极好答案的人
为了给未来的读者解决这个问题：
# import example data
data("mtcars")
# store example subset with correct data type 
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)), 
                                cyl = as.numeric(as.vector(mtcars$cyl)), 
                                disp = as.numeric(as.vector(mtcars$disp)))

# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)

# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)

# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower", 
use.row.names = TRUE)))
plot(clusters)

为了克服这个限制，我将mtcars
数据集中感兴趣的列存储在一个单独的data.frame/tible中，并通过as.numeric（as.vector（mtcars$mpg））
将所有列转换为双值
生成的subset data.frame现在只存储所需的双值
mtcars_subset

# A tibble: 32 x 3
 mpg   cyl  disp
<dbl> <dbl> <dbl>
1  21       6  160 
2  21       6  160 
3  22.8     4  108 
4  21.4     6  258 
5  18.7     8  360 
6  18.1     6  225 
7  14.3     8  360 
8  24.4     4  147.
9  22.8     4  141.
10  19.2     6  168.
# … with 22 more rows

mtcars\u子集
#一个tibble:32x3
mpg循环显示
1  21       6  160 
2  21       6  160 
3  22.8     4  108 
4  21.4     6  258 
5  18.7     8  360 
6  18.1     6  225 
7  14.3     8  360 
8  24.4     4  147.
9  22.8     4  141.
10  19.2     6  168.
#…还有22排

还请注意，如果您只提供两个输入向量的philentropy:：distance（）
函数，那么将只返回一个距离值，hclust（）
函数将无法使用一个值计算任何簇。因此，我添加了第三列disp
，以实现集群的可视化
我希望这能有所帮助。
非常感谢您提出这个伟大的问题，也感谢所有提供了出色答案的人
为了给未来的读者解决这个问题：
# import example data
data("mtcars")
# store example subset with correct data type 
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)), 
                                cyl = as.numeric(as.vector(mtcars$cyl)), 
                                disp = as.numeric(as.vector(mtcars$disp)))

# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)

# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)

# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower", 
use.row.names = TRUE)))
plot(clusters)

为了克服这个限制，我将mtcars
数据集中感兴趣的列存储在一个单独的data.frame/tible中，并通过as.numeric（as.vector（mtcars$mpg））
将所有列转换为双值
生成的subset data.frame现在只存储所需的双值
mtcars_subset

# A tibble: 32 x 3
 mpg   cyl  disp
<dbl> <dbl> <dbl>
1  21       6  160 
2  21       6  160 
3  22.8     4  108 
4  21.4     6  258 
5  18.7     8  360 
6  18.1     6  225 
7  14.3     8  360 
8  24.4     4  147.
9  22.8     4  141.
10  19.2     6  168.
# … with 22 more rows

mtcars\u子集
#一个tibble:32x3
mpg循环显示
1  21       6  160 
2  21       6  160 
3  22.8     4  108 
4  21.4     6  258 
5  18.7     8  360 
6  18.1     6  225 
7  14.3     8  360 
8  24.4     4  147.
9  22.8     4  141.
10  19.2     6  168.
#…还有22排

还请注意，如果您只提供两个输入向量的philentropy:：distance（）
函数，那么将只返回一个距离值，hclust（）
函数将无法使用一个值计算任何簇。因此，我添加了第三列disp
，以实现集群的可视化
我希望这会有所帮助。
函数返回一个矩阵，在聚类之前，尝试将其转换为带有as.dist
的dist
对象。我尝试聚类philentinum:：distance
函数返回一个矩阵，尝试在群集之前使用as.dist
将其转换为dist
对象