R 快速计算所有行之间给定字符位置匹配数的方法_R_Hamming Distance_Stringdist

R 快速计算所有行之间给定字符位置匹配数的方法

R 快速计算所有行之间给定字符位置匹配数的方法,r,hamming-distance,stringdist,R,Hamming Distance,Stringdist,我有一个矩阵，我想确定每个字符在所有成对字符中出现在同一位置的次数下面是我所做的一个例子，但我的矩阵有10000行，而且耗时太长 # This code will generate a dataframe with one row for each pair and columns that # count the number of position match each letter have my_letters <- c("A", "B", "C", "D") size_v

我有一个矩阵，我想确定每个字符在所有成对字符中出现在同一位置的次数

下面是我所做的一个例子，但我的矩阵有10000行，而且耗时太长

# This code will generate a dataframe with one row for each pair and columns that 
# count the number of position match each letter have  
my_letters <- c("A", "B", "C", "D")
size_vector <- 175
n_vectors <- 10
indexes_vectors <- seq_len(n_vectors)

mtx <- sapply(indexes_vectors, 
              function(i) sample(my_letters, n_vectors, replace = TRUE))
rownames(mtx) <- indexes_vectors

df <- as.data.frame(t(combn(indexes_vectors, m = 2)))
colnames(df) <- c("index_1", "index_2")

for(l in my_letters){
  cat(l, "\n")
  df[,l] <- apply(df[,1:2], 1,
                  function(ids) {
                    sum(mtx[ids[1],] ==  mtx[ids[2],] & 
                          mtx[ids[1],] == l, na.rm = TRUE)
                  }) 

}

#此代码将生成一个数据帧，其中每对和每列对应一行
#计算每个字母的位置匹配数
我的字母我不知道这是否会表现良好，但这是一种选择：
library(data.table)
matchDT = setDT(melt(mtx))[, 
  CJ(row1 = Var1, row2 = Var1)[row1 < row2], by=.(value, col = Var2)]
]

dcast(matchDT, row1 + row2 ~ value)

库（data.table）
matchDT=setDT（熔体（mtx））[,，
CJ（行1=Var1，行2=Var1）[row1

这不包括没有匹配项的行组合。为了让他们回来，也许
levs = seq_len(nrow(mtx))
dcast(matchDT, factor(row1, levels=levs) + factor(row2, levels = levs) ~ value, drop = FALSE)[as.integer(row1) < as.integer(row2)]

Aggregate function missing, defaulting to 'length'
    row1 row2 A B C D
 1:    1    2 1 0 2 0
 2:    1    3 1 0 1 1
 3:    1    4 1 1 0 1
 4:    1    5 0 1 1 0
 5:    1    6 1 0 1 1
 6:    1    7 0 0 1 0
 7:    1    8 0 2 1 0
 8:    1    9 1 2 2 1
 9:    1   10 0 1 1 0
10:    2    3 2 0 0 0
11:    2    4 1 0 1 0
12:    2    5 0 1 1 0
13:    2    6 1 0 1 1
14:    2    7 0 0 1 0
15:    2    8 2 0 1 0
16:    2    9 1 0 1 0
17:    2   10 1 0 1 0
18:    3    4 0 0 0 2
19:    3    5 0 0 0 0
20:    3    6 1 0 0 2
21:    3    7 1 1 1 0
22:    3    8 1 0 0 1
23:    3    9 1 1 0 0
24:    3   10 1 0 1 0
25:    4    5 0 2 1 0
26:    4    6 0 1 0 2
27:    4    7 0 0 0 0
28:    4    8 1 1 0 2
29:    4    9 0 2 0 0
30:    4   10 0 2 1 0
31:    5    6 0 1 1 0
32:    5    7 0 2 1 0
33:    5    8 0 1 0 1
34:    5    9 0 1 1 0
35:    5   10 0 2 1 1
36:    6    7 0 1 2 1
37:    6    8 0 0 0 1
38:    6    9 1 1 1 0
39:    6   10 0 1 0 0
40:    7    8 0 0 1 0
41:    7    9 0 0 1 0
42:    7   10 0 1 2 0
43:    8    9 1 2 1 0
44:    8   10 1 1 1 1
45:    9   10 0 2 1 0
    row1 row2 A B C D

levs=seq_len（nrow（mtx））
dcast（matchDT，factor（row1，levels=levs）+factor（row2，levels=levs）~value，drop=FALSE）[as.integer（row1）
一个可能的解决方案是使用base R：
l1 <- lapply(split(df, 1:nrow(df)), as.integer)

l2 <- lapply(l1, function(x) {
  m <- mtx[x[1],] == mtx[x[2],]
  l <- lapply(my_letters, '==', mtx[x[1],])
  sapply(l, function(i) sum(i & m))
})

cbind(df, setNames(do.call(rbind.data.frame, l2), my_letters))

我可能不太明白到底发生了什么。您的输出df
包含列index_1
、index_2
，以及四个字母。所以在第一行，index_1=1
和index_2=2
。然后，您想知道字母在mtx[2,1]
和mtx[1,2]
上出现了多少次？但是，每个索引对只有两个可能的字母，而您的输出df
通常有更多的字母。您还丢失了所有[x，x]
位置，尽管我不知道这是否是故意的。对于每个字母（“A”、“B”、“C”和“D”），我想知道它在mtx[1，]
和mtx[2，]
中出现在同一位置的次数，而不是在mtx[2，1]
和mtx[1,2]
中。我故意丢失了[x，x]
。将您的代码简化为单独使用combn；并将矿井改为使用梳子代替CJ；使用不同的输入参数（字母等）运行时，效果更好：非常感谢，我做了更改，您的代码运行得更快。我想我以前已经发布或看过这篇文章了。很高兴在有人找到DupeTanks时删除它，但使用microbenchmark来测量速度，您的解决方案似乎比我的慢9倍（如果我包含没有匹配的行，则为14倍），尽管它无疑更优雅。下面是我用来测试的代码的链接：@celacanto谢谢，我对性能很好奇。顺便说一句，您只需要这三个dcast中的一个就可以得到结果。（我在保持dcast（matchDT，factor（row1，levels=levs）+factor（row2，levels=levs）~value，drop=FALSE）[as.integer（row1）
1后进行了测试，结果仍然比您的慢9倍。谢谢，但使用微基准来衡量速度，您的解决方案似乎比我的慢。下面是我用来测试的代码：@celacanto在如此小的数据集上进行基准测试没有任何用处；在小数据集上速度慢的解决方案在大数据集上速度快；如果使用n\u向量
   index_1 index_2 A B C D
1        1       2 0 0 0 0
2        1       3 0 0 2 1
3        1       4 0 0 0 1
4        1       5 0 1 2 0
5        1       6 0 0 3 1
6        1       7 0 1 1 3
7        1       8 0 1 2 2
8        1       9 0 0 2 1
9        1      10 0 0 2 0
10       2       3 0 1 0 1
11       2       4 0 1 0 2
12       2       5 0 1 0 0
13       2       6 0 0 0 2
14       2       7 0 1 0 1
15       2       8 1 0 0 0
16       2       9 0 1 0 2
17       2      10 2 1 0 3
18       3       4 0 0 1 0
19       3       5 0 0 1 1
20       3       6 0 0 1 1
21       3       7 0 1 1 2
22       3       8 0 0 0 1
23       3       9 1 0 0 0
24       3      10 0 0 0 1
25       4       5 0 2 1 0
26       4       6 0 0 1 1
27       4       7 1 1 0 1
28       4       8 1 1 1 1
29       4       9 0 1 1 2
30       4      10 0 1 0 2
31       5       6 0 1 2 0
32       5       7 0 1 1 0
33       5       8 0 2 1 0
34       5       9 0 1 2 0
35       5      10 0 2 1 0
36       6       7 1 0 1 1
37       6       8 0 0 3 1
38       6       9 0 1 2 0
39       6      10 0 0 1 1
40       7       8 0 1 0 2
41       7       9 0 1 0 1
42       7      10 0 0 0 1
43       8       9 0 0 2 1
44       8      10 1 1 1 0
45       9      10 0 0 2 1

m1 <- t(sapply(1:nrow(df), function(i) 
  table(factor(mtx[df[i,1],][mtx[df[i,1],] == mtx[df[i,2],]],
               levels = my_letters))))
cbind(df, m1)

>  V1 V2 A B C D
1   1  2 0 0 1 1
2   1  3 1 0 1 1
3   1  4 1 0 2 1
4   1  5 0 0 1 0
5   1  6 2 0 2 0
6   1  7 0 0 1 0
7   1  8 1 0 1 1
8   1  9 0 0 1 0
9   1 10 1 0 1 1
10  2  3 0 0 1 1
11  2  4 1 1 1 2
12  2  5 0 0 0 1
13  2  6 1 0 2 1
14  2  7 1 0 0 1
15  2  8 1 0 0 0
16  2  9 2 0 0 0
17  2 10 1 0 1 0
18  3  4 0 0 0 0
19  3  5 0 2 1 0
20  3  6 1 1 2 1
21  3  7 0 1 0 0
22  3  8 1 1 0 0
23  3  9 0 1 2 0
24  3 10 0 0 1 0
25  4  5 1 1 0 1
26  4  6 2 1 1 0
27  4  7 1 0 1 1
28  4  8 0 1 0 0
29  4  9 1 0 0 0
30  4 10 2 0 0 0
31  5  6 0 2 0 0
32  5  7 0 1 3 1
33  5  8 0 1 2 0
34  5  9 1 0 2 0
35  5 10 0 0 2 0
36  6  7 0 0 0 0
37  6  8 1 1 0 0
38  6  9 0 0 1 0
39  6 10 3 0 1 0
40  7  8 0 1 1 0
41  7  9 1 0 1 0
42  7 10 0 0 1 0
43  8  9 1 1 1 1
44  8 10 0 0 1 0
45  9 10 0 0 0 0