R 获取表中每一行的文本,在另一个表中匹配这些文本,并创建一个包含所有匹配项的新表
让我用一个例子来解释这个问题。我有两个数据帧:R 获取表中每一行的文本,在另一个表中匹配这些文本,并创建一个包含所有匹配项的新表,r,for-loop,match,R,For Loop,Match,让我用一个例子来解释这个问题。我有两个数据帧: df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8), Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS do
df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8),
Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "ribonuclease HII", "Isoleucyl-tRNA synthetase", "Succinyl-CoA ligase"),
Species=c("aa", "bb","aa","cc","ee","ff","aa","dd"),
Number1= c(1,0,3,20,99,100,31,123),
Number2 =c(1000, 12636,12,455,231,454,123,1), stringsAsFactors = FALSE)
> df1
Gene Description Species Number1 Number2
1 1 ribonuclease HII aa 1 1000
2 2 Leucyl-tRNA synthetase bb 0 12636
3 3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195 aa 3 12
4 4 Arginyl-tRNA synthetase (EC 6.1.1.19) cc 20 455
5 5 PAS domain S-box protein ee 99 231
6 6 ribonuclease HII ff 100 454
7 7 Isoleucyl-tRNA synthetase aa 31 123
8 8 Succinyl-CoA ligase dd 123 1
我曾尝试在for循环中使用grep函数,但未能成功。这里有一个棘手的部分:当搜索“精氨酰tRNA合成酶(EC 6.1.1.19)”时,它还应该捕获“精氨酰tRNA合成酶(EC 6.1.1.19)17855:19195”。但是,当函数搜索“亮氨酸tRNA合成酶”时,不应使用“Iso亮氨酸tRNA合成酶”,它包含与“亮氨酸tRNA合成酶”相同的措辞
提前谢谢。我也愿意听取你对文章标题和编辑的建议 您可以从
dplyr
使用lejf\u join
功能:
library(dplyr)
df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8),
Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "ribonuclease HII", "Isoleucyl-tRNA synthetase", "Succinyl-CoA ligase"),
Species=c("aa", "bb","aa","cc","ee","ff","aa","dd"),
Number1= c(1,0,3,20,99,100,31,123),
Number2 =c(1000, 12636,12,455,231,454,123,1), stringsAsFactors = FALSE)
df2 <- data.frame(Description=c("ribonuclease HII", "Leucyl-tRNA synthetase","Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)"), stringsAsFactors = FALSE)
left_join(df2, df1, by = "Description") %>% select(Gene, everything())
祝你好运 您可以使用grepl()
获得所需的结果。首先,创建一个搜索模式(它没有单词边界(“\\b”
),所以它只会查找匹配项)。然后使用gsub()
对除“|”
以外的所有元字符进行转义,然后使用grepl()
对df1
中的行进行子集化:
new_pat <-paste0(df2$Description, collapse = "|")
new_pat <- gsub("([][{}().+*^$\\?])", "\\\\\\1", new_pat)
df1[grepl(new_pat, df1$Description), ]
Gene Description Species Number1 Number2
1 1 ribonuclease HII aa 1 1000
2 2 Leucyl-tRNA synthetase bb 0 12636
3 3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195 aa 3 12
4 4 Arginyl-tRNA synthetase (EC 6.1.1.19) cc 20 455
6 6 ribonuclease HII ff 100 454
new\u pat
library(dplyr)
df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8),
Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "ribonuclease HII", "Isoleucyl-tRNA synthetase", "Succinyl-CoA ligase"),
Species=c("aa", "bb","aa","cc","ee","ff","aa","dd"),
Number1= c(1,0,3,20,99,100,31,123),
Number2 =c(1000, 12636,12,455,231,454,123,1), stringsAsFactors = FALSE)
df2 <- data.frame(Description=c("ribonuclease HII", "Leucyl-tRNA synthetase","Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)"), stringsAsFactors = FALSE)
left_join(df2, df1, by = "Description") %>% select(Gene, everything())
Gene Description Species Number1 Number2
1 1 ribonuclease HII aa 1 1000
2 6 ribonuclease HII ff 100 454
3 2 Leucyl-tRNA synthetase bb 0 12636
4 3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195 aa 3 12
5 4 Arginyl-tRNA synthetase (EC 6.1.1.19) cc 20 455
new_pat <-paste0(df2$Description, collapse = "|")
new_pat <- gsub("([][{}().+*^$\\?])", "\\\\\\1", new_pat)
df1[grepl(new_pat, df1$Description), ]
Gene Description Species Number1 Number2
1 1 ribonuclease HII aa 1 1000
2 2 Leucyl-tRNA synthetase bb 0 12636
3 3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195 aa 3 12
4 4 Arginyl-tRNA synthetase (EC 6.1.1.19) cc 20 455
6 6 ribonuclease HII ff 100 454