R PAM群集-在另一个数据集中使用结果_R_Machine Learning_Cluster Analysis_Data Mining_Pam

R PAM群集-在另一个数据集中使用结果

r machine-learning

R PAM群集-在另一个数据集中使用结果,r,machine-learning,cluster-analysis,data-mining,pam,R,Machine Learning,Cluster Analysis,Data Mining,Pam,我已经使用pam函数（R中的cluster包）成功地围绕medoid运行了一个分区，现在，我想使用结果将新的观察结果归因于先前定义的集群/medoid 另一种解决问题的方法是，给定pam函数已找到的k个簇/medoid，哪一个更接近于初始数据集中没有的附加观测值 x<-matrix(c(1,1.2,0.9,2.3,2,1.8, 3.2,4,3.1,3.9,3,4.4),6,2) x [,1] [,2] [1,] 1.0 3.2 [2,] 1.2 4

我已经使用pam函数（R中的cluster包）成功地围绕medoid运行了一个分区，现在，我想使用结果将新的观察结果归因于先前定义的集群/medoid

另一种解决问题的方法是，给定pam函数已找到的k个簇/medoid，哪一个更接近于初始数据集中没有的附加观测值

x<-matrix(c(1,1.2,0.9,2.3,2,1.8,
            3.2,4,3.1,3.9,3,4.4),6,2)
x
     [,1] [,2]
[1,]  1.0  3.2
[2,]  1.2  4.0
[3,]  0.9  3.1
[4,]  2.3  3.9
[5,]  2.0  3.0
[6,]  1.8  4.4
pam(x,2)

现在，哪个簇/medoid y应该归属/关联

y<-c(1.5,4.5)

y一般来说，对k个簇尝试以下方法：
k <- 2 # pam with k clusters
res <- pam(x,k)

y <- c(1.5,4.5) # new point

# get the cluster centroid to which the new point is to be assigned to
# break ties by taking the first medoid in case there are multiple ones

# non-vectorized function
get.cluster1 <- function(res, y) which.min(sapply(1:k, function(i) sum((res$medoids[i,]-y)^2)))

# vectorized function, much faster
get.cluster2 <- function(res, y) which.min(colSums((t(res$medoids)-y)^2))

get.cluster1(res, y)
#[1] 2
get.cluster2(res, y)
#[1] 2

# comparing the two implementations (the vectorized function takes much les s time)
library(microbenchmark)
microbenchmark(get.cluster1(res, y), get.cluster2(res, y))

#Unit: microseconds
#                 expr    min     lq     mean median     uq     max neval cld
# get.cluster1(res, y) 31.219 32.075 34.89718 32.930 33.358 135.995   100   b
# get.cluster2(res, y) 17.107 17.962 19.12527 18.817 19.245  41.483   100  a 

k您可以计算从中间点到y的距离，以及哪个距离更小。Y将属于该集群。您不需要为哪个.min和距离计算创建库。只要自己写一行代码就行了！请注意，此代码仅为欧几里德距离-但您不会将pam与欧几里德距离一起使用。@任何鼠标我们都可以使用任何距离函数来替换欧几里德距离。
k <- 2 # pam with k clusters
res <- pam(x,k)

y <- c(1.5,4.5) # new point

# get the cluster centroid to which the new point is to be assigned to
# break ties by taking the first medoid in case there are multiple ones

# non-vectorized function
get.cluster1 <- function(res, y) which.min(sapply(1:k, function(i) sum((res$medoids[i,]-y)^2)))

# vectorized function, much faster
get.cluster2 <- function(res, y) which.min(colSums((t(res$medoids)-y)^2))

get.cluster1(res, y)
#[1] 2
get.cluster2(res, y)
#[1] 2

# comparing the two implementations (the vectorized function takes much les s time)
library(microbenchmark)
microbenchmark(get.cluster1(res, y), get.cluster2(res, y))

#Unit: microseconds
#                 expr    min     lq     mean median     uq     max neval cld
# get.cluster1(res, y) 31.219 32.075 34.89718 32.930 33.358 135.995   100   b
# get.cluster2(res, y) 17.107 17.962 19.12527 18.817 19.245  41.483   100  a 

# distance function
euclidean.func <- function(x, y) sqrt(sum((x-y)^2))
manhattan.func <- function(x, y) sum(abs(x-y))

get.cluster3 <- function(res, y, dist.func=euclidean.func) which.min(sapply(1:k, function(i) dist.func(res$medoids[i,], y)))
get.cluster3(res, y) # use Euclidean as default
#[1] 2
get.cluster3(res, y, manhattan.func) # use Manhattan distance
#[1] 2