R:如何将几乎相似的单词分组

R:如何将几乎相似的单词分组,r,R,我有一个DNA序列,每个单词有8个字母。样本“AAAAAAAA”“TTTTTTTT”“aaaaaacgc”“AAAACCTG”等大约有50000个单词。现在我想把所有的单词按这样的顺序分组,6个相似字母的所有单词都分组在一起。请找个人帮我。 我需要将所有的2个替换词聚类到一个聚类中,并将2个以上的替换词聚类到另一个聚类中。例如,“aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

我有一个DNA序列,每个单词有8个字母。样本“AAAAAAAA”“TTTTTTTT”“aaaaaacgc”“AAAACCTG”等大约有50000个单词。现在我想把所有的单词按这样的顺序分组,6个相似字母的所有单词都分组在一起。请找个人帮我。 我需要将所有的2个替换词聚类到一个聚类中,并将2个以上的替换词聚类到另一个聚类中。例如,“aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa。假设“aaaaaag”可以属于“AAAAAAAA”或“aaaaaaaaaac”,但不能同时属于两者。我希望你明白我的意思,如果你有任何进一步的澄清,请评论我。多谢各位

    words <- sample[1:25]
> group <- lapply(words, function(x)list(x,words[agrep(x, words,max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))]))
> group
[[1]]
[[1]][[1]]
[1] "AAAAAAAA"

[[1]][[2]]
 [1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
 [9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCA" "AAAAACGA"


[[2]]
[[2]][[1]]
[1] "AAAAAAAC"

[[2]][[2]]
 [1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
 [9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCC"


[[3]]
[[3]][[1]]
[1] "AAAAAAAG"

[[3]][[2]]
 [1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
 [9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCG"
单词组
[[1]]
[[1]][[1]]
[1] “AAAAA”
[[1]][[2]]
[1] “AAAAA”“AAAAA C”“AAAAAAAA G”“AAAAAAAA T”“AAAAA CA”“AAAAA CC”“AAAAA CG”“AAAAA CT”
[9] “AAAAA GA”“AAAAA GC”“AAAAA GG”“AAAAA GT”“AAAAA TA”“AAAAA TC”“AAAAA TG”“AAAAA TT”
[17] “AAAAA CAA”“AAAAA CAC”“AAAAA CAG”“AAAAA CAT”“AAAAA CCA”“AAAAA CGA”
[[2]]
[[2]][[1]]
[1] “AAAC”
[[2]][[2]]
[1] “AAAAA”“AAAAA C”“AAAAAAAA G”“AAAAAAAA T”“AAAAA CA”“AAAAA CC”“AAAAA CG”“AAAAA CT”
[9] “AAAAA GA”“AAAAA GC”“AAAAA GG”“AAAAA GT”“AAAAA TA”“AAAAA TC”“AAAAA TG”“AAAAA TT”
[17] “AAAAA CAA”“AAAAA CAC”“AAAAA CAG”“AAAAA CAT”“AAAAA CCC”
[[3]]
[[3]][[1]]
[1] “AAAG”
[[3]][[2]]
[1] “AAAAA”“AAAAA C”“AAAAAAAA G”“AAAAAAAA T”“AAAAA CA”“AAAAA CC”“AAAAA CG”“AAAAA CT”
[9] “AAAAA GA”“AAAAA GC”“AAAAA GG”“AAAAA GT”“AAAAA TA”“AAAAA TC”“AAAAA TG”“AAAAA TT”
[17] “AAAAA CAA”“AAAAA CAC”“AAAAA CAG”“AAAAA CAT”“AAAAA CCG”

如何减少我的输出中的冗余。

使用您的
adist
调用,您可以执行以下操作:

words <- c("AAAAAAAA", "TTTTTTTT", "AAAAAAGC", "AAAACCAA")
lapply(words, function(x) words[adist(x, words) < 3])
这将输出以下列表,其中显示您要匹配的单词,以及它匹配的所有单词,包括单词本身,但您可以根据需要调整输出:

[[1]]
[[1]]$match.word
[1] "AAAAAAAA"

[[1]]$six.letter.grp
[1] "AAAAAAAA" "AAAAAAGC" "AAAACCAA"


[[2]]
[[2]]$match.word
[1] "TTTTTTTT"

[[2]]$six.letter.grp
[1] "TTTTTTTT"


[[3]]
[[3]]$match.word
[1] "AAAAAAGC"

[[3]]$six.letter.grp
[1] "AAAAAAAA" "AAAAAAGC"


[[4]]
[[4]]$match.word
[1] "AAAACCAA"

[[4]]$six.letter.grp
[1] "AAAAAAAA" "AAAACCAA"
要获得更紧凑的列表结构,您可以尝试:

d <- lapply(words, function(x) words[agrep(x, words,
         max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))])
names(d) <- words
d
#$AAAAAAAA
#[1] "AAAAAAAA" "AAAAAAGC" "AAAACCAA"
#
#$TTTTTTTT
#[1] "TTTTTTTT"
# 
#$AAAAAAGC
#[1] "AAAAAAAA" "AAAAAAGC"
#
#$AAAACCAA
#[1] "AAAAAAAA" "AAAACCAA"

d到目前为止您尝试了什么?我使用了adist函数、matchPattern函数,但没有任何效果。请在您的帖子中展示这些努力。类似于您的帖子!我怎样才能避免输出中的冗余?@VikasBanka您可以做很多事情,这取决于您到底在寻找什么。您并没有提供一个最小的可复制示例或给出任何期望的输出,但在询问堆栈溢出问题时,您应该始终这样做。您希望避免哪些冗余?你能为一个小样本数据集的问题添加你想要的输出吗,比如我使用的
words
数据?嘿,我刚刚编辑了我的问题,你可以看到我输出的一部分,你可以看到我输出的冗余。@VikasBanka我看到你使用我的答案得到的输出,但我没有看到你想要的输出。您需要定义您希望在匹配多个单词的单词上发生什么。想想这个问题:
“aaaaaacca”
是来自
“aaaaaaa”
和来自
“AAAACCCA”
的两个替换,它们是相互之间的3个替换。
“AAAAA CCA”
应属于哪一组?这有意义吗?这里的问题是,您需要更明确地定义您的问题和所需的输出,以便人们确切地知道您想要什么。基本思想是,我需要将所有2个替换词聚类到一个聚类中,并将2个以上的替换词聚类到另一个聚类中。因此,“aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa。假设“aaaaaag”可以属于“AAAAAAAA”或“aaaaaaaaaac”,但不能同时属于两者。我希望你明白我的意思,如果你有任何进一步的澄清,请评论我。非常感谢。
d <- lapply(words, function(x) words[agrep(x, words,
         max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))])
names(d) <- words
d
#$AAAAAAAA
#[1] "AAAAAAAA" "AAAAAAGC" "AAAACCAA"
#
#$TTTTTTTT
#[1] "TTTTTTTT"
# 
#$AAAAAAGC
#[1] "AAAAAAAA" "AAAAAAGC"
#
#$AAAACCAA
#[1] "AAAAAAAA" "AAAACCAA"