R 获取表中每一行的文本,在另一个表中匹配这些文本,并创建一个包含所有匹配项的新表

R 获取表中每一行的文本,在另一个表中匹配这些文本,并创建一个包含所有匹配项的新表,r,for-loop,match,R,For Loop,Match,让我用一个例子来解释这个问题。我有两个数据帧: df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8), Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS do

让我用一个例子来解释这个问题。我有两个数据帧:

  df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8),
                 Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "ribonuclease HII", "Isoleucyl-tRNA synthetase", "Succinyl-CoA ligase"),
                 Species=c("aa", "bb","aa","cc","ee","ff","aa","dd"),
                 Number1= c(1,0,3,20,99,100,31,123),
                 Number2 =c(1000, 12636,12,455,231,454,123,1), stringsAsFactors = FALSE)

   > df1
  Gene                                       Description Species Number1 Number2
1    1                                  ribonuclease HII      aa       1    1000
2    2                            Leucyl-tRNA synthetase      bb       0   12636
3    3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195      aa       3      12
4    4             Arginyl-tRNA synthetase (EC 6.1.1.19)      cc      20     455
5    5                          PAS domain S-box protein      ee      99     231
6    6                                  ribonuclease HII      ff     100     454
7    7                         Isoleucyl-tRNA synthetase      aa      31     123
8    8                               Succinyl-CoA ligase      dd     123       1
我曾尝试在for循环中使用grep函数,但未能成功。这里有一个棘手的部分:当搜索“精氨酰tRNA合成酶(EC 6.1.1.19)”时,它还应该捕获“精氨酰tRNA合成酶(EC 6.1.1.19)17855:19195”。但是,当函数搜索“亮氨酸tRNA合成酶”时,不应使用“Iso亮氨酸tRNA合成酶”,它包含与“亮氨酸tRNA合成酶”相同的措辞


提前谢谢。我也愿意听取你对文章标题和编辑的建议

您可以从
dplyr
使用
lejf\u join
功能:

library(dplyr)
df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8),
                  Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "ribonuclease HII", "Isoleucyl-tRNA synthetase", "Succinyl-CoA ligase"),
                  Species=c("aa", "bb","aa","cc","ee","ff","aa","dd"),
                  Number1= c(1,0,3,20,99,100,31,123),
                  Number2 =c(1000, 12636,12,455,231,454,123,1), stringsAsFactors = FALSE)

df2 <- data.frame(Description=c("ribonuclease HII", "Leucyl-tRNA synthetase","Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)"), stringsAsFactors = FALSE)

left_join(df2, df1, by = "Description") %>% select(Gene, everything())
祝你好运

您可以使用
grepl()
获得所需的结果。首先,创建一个搜索模式(它没有单词边界(
“\\b”
),所以它只会查找匹配项)。然后使用
gsub()
对除
“|”
以外的所有元字符进行转义,然后使用
grepl()
df1
中的行进行子集化:

new_pat <-paste0(df2$Description, collapse = "|")
new_pat <- gsub("([][{}().+*^$\\?])", "\\\\\\1", new_pat)

df1[grepl(new_pat, df1$Description), ]
  Gene                                       Description Species Number1 Number2
1    1                                  ribonuclease HII      aa       1    1000
2    2                            Leucyl-tRNA synthetase      bb       0   12636
3    3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195      aa       3      12
4    4             Arginyl-tRNA synthetase (EC 6.1.1.19)      cc      20     455
6    6                                  ribonuclease HII      ff     100     454
new\u pat
library(dplyr)
df1 <- data.frame(Gene=c(1,2,3,4,5,6,7,8),
                  Description=c("ribonuclease HII", "Leucyl-tRNA synthetase", "Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)", "PAS domain S-box protein", "ribonuclease HII", "Isoleucyl-tRNA synthetase", "Succinyl-CoA ligase"),
                  Species=c("aa", "bb","aa","cc","ee","ff","aa","dd"),
                  Number1= c(1,0,3,20,99,100,31,123),
                  Number2 =c(1000, 12636,12,455,231,454,123,1), stringsAsFactors = FALSE)

df2 <- data.frame(Description=c("ribonuclease HII", "Leucyl-tRNA synthetase","Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195", "Arginyl-tRNA synthetase (EC 6.1.1.19)"), stringsAsFactors = FALSE)

left_join(df2, df1, by = "Description") %>% select(Gene, everything())
  Gene                                       Description Species Number1 Number2
1    1                                  ribonuclease HII      aa       1    1000
2    6                                  ribonuclease HII      ff     100     454
3    2                            Leucyl-tRNA synthetase      bb       0   12636
4    3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195      aa       3      12
5    4             Arginyl-tRNA synthetase (EC 6.1.1.19)      cc      20     455
new_pat <-paste0(df2$Description, collapse = "|")
new_pat <- gsub("([][{}().+*^$\\?])", "\\\\\\1", new_pat)

df1[grepl(new_pat, df1$Description), ]
  Gene                                       Description Species Number1 Number2
1    1                                  ribonuclease HII      aa       1    1000
2    2                            Leucyl-tRNA synthetase      bb       0   12636
3    3 Arginyl-tRNA synthetase (EC 6.1.1.19) 17855:19195      aa       3      12
4    4             Arginyl-tRNA synthetase (EC 6.1.1.19)      cc      20     455
6    6                                  ribonuclease HII      ff     100     454