R 子集匹配变量数为k的数据帧
例如,我想在匹配变量的数量等于一个数字的条件下对数据帧进行子集R 子集匹配变量数为k的数据帧,r,variables,subset,matching,R,Variables,Subset,Matching,例如,我想在匹配变量的数量等于一个数字的条件下对数据帧进行子集 example <- rbind(sample(letters[1:5]), sample(letters[1:5]), sample(letters[1:5]), sample(letters[1:5]), sample(letters[1:5])) example [,1] [,2] [,3] [,4]
example <- rbind(sample(letters[1:5]),
sample(letters[1:5]),
sample(letters[1:5]),
sample(letters[1:5]),
sample(letters[1:5]))
example
[,1] [,2] [,3] [,4] [,5]
[1,] "b" "a" "d" "e" "c"
[2,] "e" "c" "a" "d" "b"
[3,] "c" "a" "d" "b" "e"
[4,] "b" "d" "e" "c" "a"
[5,] "b" "c" "e" "d" "a"
示例创建可复制数据
set.seed(47)
example <- rbind(sample(letters[1:5]),
sample(letters[1:5]),
sample(letters[1:5]),
sample(letters[1:5]),
sample(letters[1:5]))
example
# [,1] [,2] [,3] [,4] [,5]
#[1,] "e" "b" "c" "d" "a"
#[2,] "d" "b" "e" "c" "a"
#[3,] "a" "c" "e" "b" "d"
#[4,] "e" "b" "a" "c" "d"
#[5,] "a" "c" "b" "e" "d"
set.seed(47)
例1,]
# [,1] [,2] [,3] [,4] [,5]
#[1,]a“c”e“b”d
#[2,]a“c”b“e”d
在这里,我们将每一行与每一行进行元素比较,如果它等于或大于阈值(n
),则计算相等比较的次数。另一个循环是过滤掉与自身相等的行 另一种方法是使用combn
两次,第一次枚举对,第二次执行成对比较
以ronak shah为例
combn(seq_len(nrow(example)), 2)[, combn(seq_len(nrow(example)), 2,
FUN=function(x) sum(example[x[1],] == example[x[2],]) >= 3)]
[1] 3 5
它指示要保留的行
这通常会返回一个矩阵,行集可能会重复。例如,将阈值设置为2,我们得到
[,1] [,2] [,3] [,4]
[1,] 1 1 2 3
[2,] 2 4 4 5
要将其转化为有用的内容,请使用c
将结果转化为向量,然后使用unique
删除重复的行。当我们这样做的时候,我们不妨将整个过程包装成一个函数,该函数允许选择阈值
rowKeeper <- function(myMat, thresh) {
myMat[unique(c(combn(seq_len(nrow(myMat)), 2)[,
combn(seq_len(nrow(example)), 2,
FUN=function(x) sum(myMat[x[1],] == myMat[x[2],]) >= thresh)])),]
}
应将矩阵中的每一行与矩阵中的每一行进行比较,如果它与任何行的阈值(此处为3)相匹配,则将选择这两行?确切地说,应选择具有3个公共元素的行。
rowKeeper <- function(myMat, thresh) {
myMat[unique(c(combn(seq_len(nrow(myMat)), 2)[,
combn(seq_len(nrow(example)), 2,
FUN=function(x) sum(myMat[x[1],] == myMat[x[2],]) >= thresh)])),]
}
rowKeeper(example, 3)
[,1] [,2] [,3] [,4] [,5]
[1,] "a" "c" "e" "b" "d"
[2,] "a" "c" "b" "e" "d"