R 从群体中选择最不相似个体的最佳方法是什么？_R_Cluster Analysis_Genetic Algorithm_Markerclusterer

R 从群体中选择最不相似个体的最佳方法是什么？

R 从群体中选择最不相似个体的最佳方法是什么？,r,cluster-analysis,genetic-algorithm,markerclusterer,R,Cluster Analysis,Genetic Algorithm,Markerclusterer,我尝试使用k均值聚类来选择我的群体中最多样化的标记，例如，如果我们想要选择100条线，我将整个群体聚类为100个簇，然后从每个簇中选择离质心最近的标记我的解决方案的问题是它花费了太多的时间，可能我的函数需要优化，特别是当标记数超过100000时因此，如果有人能给我展示一种新的方法来选择标记，使我的群体中的多样性最大化和/或帮助我优化我的功能，使其更快地工作，我将不胜感激多谢各位 # example: library(BLR) data(wheat) dim(X) mdf<-most

我尝试使用k均值聚类来选择我的群体中最多样化的标记，例如，如果我们想要选择100条线，我将整个群体聚类为100个簇，然后从每个簇中选择离质心最近的标记

我的解决方案的问题是它花费了太多的时间，可能我的函数需要优化，特别是当标记数超过100000时

因此，如果有人能给我展示一种新的方法来选择标记，使我的群体中的多样性最大化和/或帮助我优化我的功能，使其更快地工作，我将不胜感激

多谢各位

# example:

library(BLR)
data(wheat)
dim(X)
mdf<-mostdiff(t(X), 100,1,nstart=1000)

以下是我使用的mostdiff函数：

mostdiff <- function(markers, nClust, nMrkPerClust, nstart=1000) {
    transposedMarkers <- as.array(markers)
    mrkClust <- kmeans(transposedMarkers, nClust, nstart=nstart)
    save(mrkClust, file="markerCluster.Rdata")

    # within clusters, pick the markers that are closest to the cluster centroid
    # turn the vector of which markers belong to which clusters into a list nClust long
    # each element of the list is a vector of the markers in that cluster

    clustersToList <- function(nClust, clusters) {
        vecOfCluster <- function(whichClust, clusters) {
            return(which(whichClust == clusters))
        }
        return(apply(as.array(1:nClust), 1, vecOfCluster, clusters))
    }

    pickCloseToCenter <- function(vecOfCluster, whichClust, transposedMarkers, centers, pickHowMany) {
        clustSize <- length(vecOfCluster)
        # if there are fewer than three markers, the center is equally distant from all so don't bother
        if (clustSize < 3) return(vecOfCluster[1:min(pickHowMany, clustSize)])

        # figure out the distance (squared) between each marker in the cluster and the cluster center
        distToCenter <- function(marker, center){
            diff <- center - marker    
            return(sum(diff*diff))
        }

        dists <- apply(transposedMarkers[vecOfCluster,], 1, distToCenter, center=centers[whichClust,])
        return(vecOfCluster[order(dists)[1:min(pickHowMany, clustSize)]]) 
    }
}

尽管我认为代码中最慢的部分实际上是kmeans，但您可以尝试下面的方法。对于大型数据集，您可以考虑，根据数据的形状，减少NSTART参数或子设置。

library(plyr)

markers <- data.frame(x=rnorm(1e6), y=rnorm(1e6), z=rnorm(1e6))

mostdiff <- function(markers, iter.max=1e5) {
    ncols <- ncol(markers)

    km <- kmeans(markers, 100, iter.max=iter.max)

    markers$cluster <- km$cluster
    markers$d <- rowSums(apply(
        markers[,1:ncols] - km$centers[markers$cluster], 2, function(x) x * x
    ))

    result <- subset(
        merge(
            ddply(markers, ~cluster, summarise, d=min(d)),
            markers,
            x.all=T, y.all=F
        ),
        select=-c(d, cluster)
    )

    return(result)
}

mostdiff(markers, 100)

如果你在人群中寻找异常值，而不一定要用标记来识别它们，我建议使用。它通常是异常值识别的首选工具

k <- 1000 # Number of outliers from the population we want
n <- length(x)
ma.dist <- mahalanobis(x, colMeans(x), cov(x))
ix <- order(ma.dist)
mdf <- x[ix >= n - k]

如果kmeans是最消耗的部分，您可以将k-means算法应用于人口的随机子集。如果随机子集的大小与您选择的质心数量相比仍然很大，那么您将得到大致相同的结果。或者，您可以在几个子集上运行几个kmeans并合并结果

另一个选择是尝试该算法，该算法将给出作为总体一部分的质心，因此不需要第二部分查找最接近其质心的每个簇的成员。不过，它可能比k-means要慢。

以防其他人也尝试做同样的事情。以下是基于damienfrancois建议的答案：除了使用原始数据外，pam k-medriod还允许我们使用自己的距离矩阵，这在标记数据中有很多缺失值的情况下非常重要

library(BLR)

data(wheat)

library(cluster)

pam_out<-pam(t(X),100)

selec.markers<-as.data.frame(colnames(X)[pam_out$id.med])

请更正代码格式。现在真的很难读。另外，mostdiff的结束方括号也丢失了。很抱歉，我已经修复了它，这是我第二次用这种方式提问。删除荒谬的空白量当然是我编辑计划的一部分，但我打算在一些缩进中离开。深度嵌套的函数确实阻碍了理解此代码。好的，我放弃编辑。。。谢谢你改进了我的postZero323：非常感谢你的解决方案，但是每次我在真实数据集X上运行你的代码时，它会生成具有不同dim的矩阵-我们只需要100个不同个体的列表。达米安·弗朗索瓦：在读了更多关于k-medoid的文章后，我会尝试使用它，我不知道如何使用，但我会努力做到。首先感谢您，您可以在脚本顶部添加librarycluster并替换行mrkClust damienfrancois：我已尝试对您的答案进行投票，但我不允许这样做，因为我需要15个声誉来完成该操作。@AaronD我认为您应该可以通过单击V符号来接受它。而且，当你在文章中提到某人时，你应该像我在这篇评论开头所做的那样，在前面加上@：