R 子集匹配变量数为k的数据帧_R_Variables_Subset_Matching

R 子集匹配变量数为k的数据帧

r variables

R 子集匹配变量数为k的数据帧,r,variables,subset,matching,R,Variables,Subset,Matching,例如，我想在匹配变量的数量等于一个数字的条件下对数据帧进行子集 example <- rbind(sample(letters[1:5]), sample(letters[1:5]), sample(letters[1:5]), sample(letters[1:5]), sample(letters[1:5])) example [,1] [,2] [,3] [,4]

例如，我想在匹配变量的数量等于一个数字的条件下对数据帧进行子集

example <- rbind(sample(letters[1:5]),
             sample(letters[1:5]),
             sample(letters[1:5]),
             sample(letters[1:5]),
             sample(letters[1:5]))


example

     [,1] [,2] [,3] [,4] [,5]
[1,] "b"  "a"  "d"  "e"  "c" 
[2,] "e"  "c"  "a"  "d"  "b" 
[3,] "c"  "a"  "d"  "b"  "e" 
[4,] "b"  "d"  "e"  "c"  "a" 
[5,] "b"  "c"  "e"  "d"  "a"

示例创建可复制数据
set.seed(47)
example <- rbind(sample(letters[1:5]),
                 sample(letters[1:5]),
                 sample(letters[1:5]),
                 sample(letters[1:5]),
                 sample(letters[1:5]))

example
#    [,1] [,2] [,3] [,4] [,5]
#[1,] "e"  "b"  "c"  "d"  "a" 
#[2,] "d"  "b"  "e"  "c"  "a" 
#[3,] "a"  "c"  "e"  "b"  "d" 
#[4,] "e"  "b"  "a"  "c"  "d" 
#[5,] "a"  "c"  "b"  "e"  "d" 

set.seed（47）
例1，]
#    [,1] [,2] [,3] [,4] [,5]
#[1，]a“c”e“b”d
#[2，]a“c”b“e”d

在这里，我们将每一行与每一行进行元素比较，如果它等于或大于阈值（n
），则计算相等比较的次数。另一个循环是过滤掉与自身相等的行
 另一种方法是使用combn
两次，第一次枚举对，第二次执行成对比较
以ronak shah为例
combn(seq_len(nrow(example)), 2)[, combn(seq_len(nrow(example)), 2,
                                 FUN=function(x) sum(example[x[1],] == example[x[2],]) >= 3)]
[1] 3 5

它指示要保留的行
这通常会返回一个矩阵，行集可能会重复。例如，将阈值设置为2，我们得到
     [,1] [,2] [,3] [,4]
[1,]    1    1    2    3
[2,]    2    4    4    5

要将其转化为有用的内容，请使用c
将结果转化为向量，然后使用unique
删除重复的行。当我们这样做的时候，我们不妨将整个过程包装成一个函数，该函数允许选择阈值
rowKeeper <- function(myMat, thresh) {
   myMat[unique(c(combn(seq_len(nrow(myMat)), 2)[,
         combn(seq_len(nrow(example)), 2,
               FUN=function(x) sum(myMat[x[1],] == myMat[x[2],]) >= thresh)])),]
}

应将矩阵中的每一行与矩阵中的每一行进行比较，如果它与任何行的阈值（此处为3）相匹配，则将选择这两行？确切地说，应选择具有3个公共元素的行。
rowKeeper <- function(myMat, thresh) {
   myMat[unique(c(combn(seq_len(nrow(myMat)), 2)[,
         combn(seq_len(nrow(example)), 2,
               FUN=function(x) sum(myMat[x[1],] == myMat[x[2],]) >= thresh)])),]
}

rowKeeper(example, 3)
     [,1] [,2] [,3] [,4] [,5]
[1,] "a"  "c"  "e"  "b"  "d" 
[2,] "a"  "c"  "b"  "e"  "d"