如何在kproto函数中实现jaccard距离

如何在kproto函数中实现jaccard距离,r,cluster-analysis,R,Cluster Analysis,我试图在R中的kproto函数包clustMixType中实现jaccard的距离,但没有成功。目的是对我的数据集进行聚类分析 我想使用的jaccard距离是jaccard相似系数的补码,所以 distance of jaccard = 1-[a/(a+b+c)] = [(b+c)/(a+b+c)], or distance of jaccard = 1-[M11/(M01+M10+M11)] = [(M01+M10)/(M01+M10+M11)]. kproto函数的源代码如下所示。变量d1

我试图在R中的kproto函数包clustMixType中实现jaccard的距离,但没有成功。目的是对我的数据集进行聚类分析

我想使用的jaccard距离是jaccard相似系数的补码,所以

distance of jaccard = 1-[a/(a+b+c)] = [(b+c)/(a+b+c)], or
distance of jaccard = 1-[M11/(M01+M10+M11)] = [(M01+M10)/(M01+M10+M11)].
kproto函数的源代码如下所示。变量d1是数值变量的欧氏距离,变量d2是分类变量作为因子与简单匹配系数之间的距离

它计算观测值和原型之间的距离,而不是观测值之间的距离。原型是计算出来的,而不是观察数据集本身

所以我的两个问题是 1 d2是我想要修改的,但是如何修改? 2 d1是否应为所计算内容的sqrt

谢谢你的帮助。这将是非常宝贵的

下面是我正在处理的数据集的摘录,其中V1到V4是使用jaccard距离的因子二进制变量,V5到V8是使用欧几里德距离的数字变量:

V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
1;1;0;0;1;3;2;1
1;1;0;0;4;3;3;1
1;1;1;0;1;4;1;1
1;0;1;0;4;3;1;1
1;1;0;0;2;4;2;1
1;1;0;0;2;4;2;1
1;1;0;0;6;2;1;1
1;1;0;0;6;2;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;3;1
1;1;0;0;4;4;2;1
1;1;0;0;4;3;3;1
1;1;0;0;4;3;3;1
1;1;0;0;3;4;2;1
1;1;0;0;4;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;4;1
1;1;0;0;4;3;2;1
1;1;0;0;4;3;2;1
1;1;0;0;3;3;2;1
1;1;0;0;3;3;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;1;0;0;3;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;0;0;0;3;4;2;1
1;1;0;0;7;2;4;1
1;1;0;0;7;2;2;1
1;1;0;0;5;2;4;1
1;1;0;0;5;3;4;1
1;1;0;0;5;3;2;1
1;1;0;0;5;3;4;1
1;0;0;0;3;5;3;1
0;1;0;0;6;2;4;1
1;1;0;0;6;2;3;1
1;1;0;0;6;2;4;1
让我们以上面提供的数据集中的前两个观测值为例:

V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
该算法首先从数据集中随机选择k个原型,从而保证第二个观测值是一个初始原型。据我所知,该算法最初创建了一个名为protos的数据帧,其中包含来自所提供数据集的k个随机观测值,因此第二个观测值将是proto数据帧的一行

用于对观测值进行聚类的组合距离为d=d1+lambda*d2。Lambda也可以是每个变量的单个权重向量。d是所提供数据集中的观测值与最初使用k个随机观测值创建的原型矩阵之间的距离

在这种情况下,考虑到前两个观测值,观测值yi和原型yk之间的计算距离如下:

Euclidian for the numeric variables (V5 to V8):
 d1=sum[(yij-ykj)^2]^0,5
 where,
    k=1 to k clusters
    i=1 to n observations
    j=5 to 8 th variable
 d1=[[(6-3)^2]+[(2-5)^2]+[(3-2)^2]+[(3-1)^2]]^0,5
 d1=[9+9+1+4]^0,5
 d1=4.796

Jaccard, for the set of binary variables (V1 to V4):
 d2=[(b+c)/(a+b+c)]
 where,
    a=1
    b=1
    c=1
    are correspondences counts between the n observations and the k prototypes, for variables 1 to 4.
 d2=[(1+1)/(1+1+1)]
 d2=2/3
 d2=0.667
因此,这一特殊观测与该星团初始原型之间的组合距离为:

d=d1+d2
d=4.796+0.667
d=5.463
据我所知,结果随后被存储在一个称为d的矩阵中,逐行存储,大小为[行数=观察数,列数=簇数k]

我希望正确计算欧几里得距离和雅卡距离,修改kproto函数,保持原始函数提供的步骤和结果

注意:最后一个函数应该处理任意数量的观察、变量和原型,而不仅仅是我的特定数据集

我也尝试过混合kproto clustMixType包和dist.binary ade4包中的代码,但它们的工作方式不同

#K-Prototypes algorithm
kproto.default <- function(x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, keep.data = TRUE, verbose = TRUE, ...){

  # initial error checks
  if(!is.data.frame(x)) stop("x should be a data frame!")
  if(ncol(x) < 2) stop("For clustering x should contain at least two variables!")
  if(iter.max < 1 | nstart < 1) stop("iter.max and nstart must not be specified < 1!")
  if(!is.null(lambda)){
    if(any(lambda < 0)) stop("lambda must be specified >= 0!")
    if(!any(lambda > 0)) stop("lambda must be specified > 0 for at least one variable!")
    }
  # check for numeric and factor variables
  numvars <- sapply(x, is.numeric)
  anynum <- any(numvars)
  catvars <- sapply(x, is.factor)
  anyfact <- any(catvars)
  if(!anynum) stop("\n No numeric variables in x! Try using kmodes() from package klaR...\n\n")
  if(!anyfact) stop("\n No factor variables in x! Try using kmeans()...\n\n")

  # treatment of missings
  NAcount <- apply(x, 2, function(z) sum(is.na(z)))
  if(verbose){
    cat("# NAs in variables:\n")
    print(NAcount)
  }
  if(any(NAcount == nrow(x))) stop(paste("Variable(s) have only NAs please remove them:",names(NAcount)[NAcount == nrow(x)],"!"))
  if(na.rm) {
    miss <- apply(x, 1, function(z) any(is.na(z)))
    if(verbose){
      cat(sum(miss), "observation(s) with NAs.\n")
      if(sum(miss) > 0) message("Observations with NAs are removed.\n")
      cat("\n")
    } 
    x <- x[!miss,]
    } # remove missings

  if(!na.rm){
    allNAs <- apply(x,1,function(z) all(is.na(z)))
    if(sum(allNAs) > 0){
      if(verbose) cat(sum(allNAs), "observation(s) where all variables NA.\n")
      warning("No meaningful cluster assignment possible for observations where all variables NA.\n")
      if(verbose) cat("\n")

    }
  }

  if(nrow(x) == 1) stop("Only one observation clustering not meaningful.")

  k_input <- k # store input k for nstart > 1 as clusters can be merged 

  # initialize prototypes
  if(!is.data.frame(k)){
    if (length(k) == 1){
      if(as.integer(k) != k){k <- as.integer(k); warning(paste("k has been set to", k,"!"))}
      if(nrow(x) < k) stop("Data frame has less observations than clusters!")
      ids <- sample(nrow(x), k)
      protos <- x[ids,]
    }
    if (length(k) > 1){
      if(nrow(x) < length(k)) stop("Data frame has less observations than clusters!")
      ids <- k
      k <- length(ids)
      if(length(unique(ids)) != length(ids)) stop("If k is specified as a vector it should contain different indices!")
      if(any(ids<1)|any(ids>nrow(x))) stop("If k is specified as a vector all elements must be valid indices of x!")
      #check for integer
      protos <- x[ids,]
    }
    rm(ids)
  }
  if(is.data.frame(k)){
    if(nrow(x) < nrow(k)) stop("Data frame has less observations than clusters!")
    if(length(names(k)) != length(names(x))) stop("k and x have different numbers of columns!")
    if(any(names(k) != names(x))) stop("k and x have different column names!")
    if(anynum) {if( any(sapply(k, is.numeric) != numvars)) stop("Numeric variables of k and x do not match!")}
    if(anyfact) {if( any(sapply(k, is.factor) != catvars)) stop("Factor variables of k and x do not match!")}
    protos <- k
    k <- nrow(protos)
  }
  if(k < 1) stop("Number of clusters k must not be smaller than 1!")

  # automatic calculation of lambda
  if(length(lambda) > 1) {if(length(lambda) != sum(c(numvars,catvars))) stop("If lambda is a vector, its length should be the sum of numeric and factor variables in the data frame!")}
  if(is.null(lambda)){
    if(anynum & anyfact){
      vnum <- mean(sapply(x[,numvars, drop = FALSE], var, na.rm = TRUE))
      vcat <- mean(sapply(x[,catvars, drop = FALSE], function(z) return(1-sum((table(z)/sum(!is.na(z)))^2))))
      if (vnum == 0){
        if(verbose) warning("All numerical variables have zero variance.")
        anynum <- FALSE
      } 
      if (vcat == 0){
        if(verbose) warning("All categorical variables have zero variance.")
        anyfact <- FALSE
      } 
      if(anynum & anyfact){
        lambda <- vnum/vcat
        if(verbose) cat("Estimated lambda:", lambda, "\n\n")
      }else{
        lambda <- 1
      }
    }
  }

  # initialize clusters
  clusters  <- numeric(nrow(x)) 
  tot.dists <- NULL
  moved   <- NULL
  iter <- 1

  # check for any equal prototypes and reduce cluster number in case of occurence
  if(k > 1){
    keep.protos <- rep(TRUE,k)
    for(l in 1:(k-1)){
      for(m in (l+1):k){
        d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
        d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics 
        if((d1+d2) == 0) keep.protos[m] <- FALSE 
      }
    }
    if(!all(keep.protos)){
      protos <- protos[keep.protos,]
      k <- sum(keep.protos)
      if(verbose) message("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")      
    }
  }

  # special case only one cluster
  if(k == 1){clusters <- rep(1, nrow(x)); size  <- table(clusters); iter <- iter.max} # REM: named vector size is needed later...

  # start iterations for standard case (i.e. k > 1)
  while(iter < iter.max){

    # compute distances 
    nrows <- nrow(x)
    dists <- matrix(NA, nrow=nrows, ncol = k)
    for(i in 1:k){
      #a0 <- proc.time()[3]      
      #d1 <- apply(x[,numvars],1, function(z) sum((z-protos[i,numvars])^2)) # euclidean for numerics
      d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
      if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
      if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
      #a1 <- proc.time()[3]      
      #d2 <- lambda * apply(x[,catvars],1, function(z) sum((z != protos[i,catvars]))) # wtd simple matching for categorics 
      d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
      d2[is.na(d2)] <- FALSE
      if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
      if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
      #a2 <- proc.time()[3]      
      dists[,i] <- d1 + d2
      #cat(a1-a0, a2-a1, "\n")
    }

    # assign clusters 
    old.clusters  <- clusters
    # clusters      <- apply(dists, 1, function(z) which.min(z))
    clusters      <- apply(dists, 1, function(z) {a <- which.min(z); if (length(a)>1) a <- sample(a,1); return(a)}) # sample in case of multiple minima
    size          <- table(clusters)  
    min.dists     <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
    within        <- as.numeric(by(min.dists, clusters, sum))
    tot.within    <- sum(within)
    # prevent from empty classes
    #tot.within    <- numeric(k)
    #totw.list     <- by(min.dists, clusters, sum) 
    #tot.within[names(totw.list)] <- as.numeric(totw.list)

    # ...check for empty clusters and eventually reduce number of prototypes    
    if (length(size) < k){
      k <- length(size)
      protos <- protos[1:length(size),]  
      if(verbose) cat("Empty clusters occur. Cluster number reduced to:", k, "\n\n")
    }

    # trace
    tot.dists <- c(tot.dists, sum(tot.within))      
    moved <- c(moved, sum(clusters != old.clusters))

    # compute new prototypes
    remids <- as.integer(names(size))
    for(i in remids){
      protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
      protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
    }

    if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}

    # check for any equal prototypes and reduce cluster number in case of occurence
    if(iter == (iter.max-1)){ # REM: for last iteration equal prototypes are allowed. otherwise less prototypes than assigned clusters.
      keep.protos <- rep(TRUE,k)
      for(l in 1:(k-1)){
        for(m in (l+1):k){
          d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
          d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics 
          if((d1+d2) == 0) keep.protos[m] <- FALSE 
        }
      }
      if(!all(keep.protos)){
        protos <- protos[keep.protos,]
        k <- sum(keep.protos)
        if(verbose) cat("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")      
      }
    }

    # add stopping rules
    if(moved[length(moved)] ==  0) break

    if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}

    #cat("iter", iter, "moved", moved[length(moved)], "tot.dists",tot.dists[length(tot.dists)],"\n" )      
    iter <- iter+1
  }


  ### Final update of prototypes and dists
  if(iter == iter.max){ # otherwise there have been no moves anymore and prototypes correspond to cluster assignments 
    # compute new prototypes
    remids <- as.integer(names(size))
    for(i in remids){
      protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
      protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
    }

    # compute distances 
    nrows <- nrow(x)
    dists <- matrix(NA, nrow=nrows, ncol = k)
    for(i in 1:k){
      d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
      if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
      if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
      d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
      d2[is.na(d2)] <- FALSE
      if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
      if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
      dists[,i] <- d1 + d2
    }

    size          <- table(clusters)  
    min.dists     <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
    within        <- as.numeric(by(min.dists, clusters, sum))
    tot.within    <- sum(within)
  }


  names(clusters) <- row.names(dists) <- row.names(x)
  rownames(protos) <- NULL
  # create result: 
  res <- list(cluster = clusters,  
              centers = protos, 
              lambda = lambda, 
              size = size,
              withinss = within,
              tot.withinss = tot.within,   
              dists = dists, 
              iter = iter, 
              trace = list(tot.dists = tot.dists, moved = moved))

  # loop: if nstart > 1:
  if(nstart > 1)
    for(j in 2:nstart){
      res.new <- kproto(x=x, k=k_input, lambda = lambda,  iter.max = iter.max, nstart=1, verbose=verbose)
      if(res.new$tot.withinss < res$tot.withinss) res <- res.new
    }  

  if(keep.data) res$data = x
  class(res) <- "kproto"
  return(res)
}

我已经设法修改了函数,以接受各种相似性度量,并将R文件上载到,以备有人需要。

请详细说明您的计算结果好吗?Jaccard索引是在集合之间计算的,而欧几里德距离是在点之间计算的。我很难计算出V1-V4计算自定义Jaccard索引所需的计算,也无法使用V5-V8计算欧几里德距离,因为它是4个值。你能提供一个计算的例子和你的数据的小样本所需要的输出吗?非常感谢你的答复。我会尽量说清楚的。我将在问题中添加您所问的内容,以便更容易理解。NM uu请您检查我在问题中所做的更改,好吗?我尽量说清楚。多谢各位@我不确定你是否知道如何通知别人。像@nmOh一样使用@username,我不知道如何正确地通知他们。谢谢@mnm