R:如何从另一个表中的表中删除值?

R:如何从另一个表中的表中删除值?,r,rsqlite,R,Rsqlite,我有如下数据: > head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"')) gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value q_value significant 1 XLOC_000219 M4 M3 OK 3.85

我有如下数据:

> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
      gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value   q_value significant
1 XLOC_000219       M4       M3     OK 3.85465 0.00000             -Inf        NA   5e-05 0.0075951         yes
2 XLOC_004272       M4       M3     OK 2.06687 0.00000             -Inf        NA   5e-05 0.0075951         yes
3 XLOC_004991       M4       M3     OK 3.29904 0.00000             -Inf        NA   5e-05 0.0075951         yes
4 XLOC_007234       M4       M3     OK 1.28027 0.00000             -Inf        NA   5e-05 0.0075951         yes
5 XLOC_000664       M4       F4     OK 1.46853 0.00000             -Inf        NA   5e-05 0.0075951         yes
6 XLOC_001809       M4       F4     OK 0.00000 1.91743              Inf        NA   5e-05 0.0075951         yes
M4M3 <- dbGetQuery(mydb, '
SELECT * 
FROM geneExpDiffData 
WHERE significant = "yes" 
AND sample_1 = "M4" 
AND sample_2 = "M3"
AND gene_id not in (SELECT gene_id 
                    FROM geneExpDiffData 
                    WHERE significant = "yes" 
                    AND sample_1 = "M4" 
                    AND sample_2 = "F4")
')
我用RSQLite生成了两个子集:

M4M3 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "M3"')

M4F4 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "F4"')
我想从M4M3中删除所有在M4F4中具有匹配gene_id的值。我使用RSQLite来过滤数据集并不重要,它可能是一个纯R解决方案,但我不确定如何比较表并从一个表中删除另一个表中的行


谢谢你的建议

有很多方法可以做到这一点

如上Balter所述的基本R子集解决方案:

M4M3.new <- M4M3[!(M4M3$gene_id %in% M4F4$gene_id),]
基R集并集解决方案:

M4M3.new <- setdiff(M4M3, M4F4)
Dplyr溶液

M4M3.new <- dplyr::anti_join(M4M3, 
                             M4F4, 
                             by = c("gene_id" = "gene_id"))
编辑:在以下数据集上测试的所有数据似乎都正常工作:

tst1 <- data.frame(gene_id = seq(1:10), 
                   sample_1 = rep("M4", 10), 
                   sample_2 = c(rep("M3", 6), rep("F4", 4)), 
                   other_values = sample(1:10, 10, replace = T),
                   other_values2 = rep("OK", 10))

M4M3 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "M3",]
M4F4 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "F4",]

有很多方法可以做到这一点

如上Balter所述的基本R子集解决方案:

M4M3.new <- M4M3[!(M4M3$gene_id %in% M4F4$gene_id),]
基R集并集解决方案:

M4M3.new <- setdiff(M4M3, M4F4)
Dplyr溶液

M4M3.new <- dplyr::anti_join(M4M3, 
                             M4F4, 
                             by = c("gene_id" = "gene_id"))
编辑:在以下数据集上测试的所有数据似乎都正常工作:

tst1 <- data.frame(gene_id = seq(1:10), 
                   sample_1 = rep("M4", 10), 
                   sample_2 = c(rep("M3", 6), rep("F4", 4)), 
                   other_values = sample(1:10, 10, replace = T),
                   other_values2 = rep("OK", 10))

M4M3 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "M3",]
M4F4 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2  == "F4",]

如果希望联接在数据库上运行,还可以通过dbplyr进行连接:

library(dplyr)
src <- dbplyr::src_dbi(db)
geneExpDiffData <- tbl(src, "geneExpDiffData")

M4M3 <- geneExpDiffData %>%
  filter(significant == "yes" & sample_1 == "M4" & sample_2 == "M3")

M4F3 <- geneExpDiffData %>%
  filter(significant == "yes" & sample_1 == "M4" & sample_2 == "F4")

anti_join(M4M3, M4F3)

在中了解更多信息。

如果希望在数据库上运行连接,还可以通过dbplyr进行连接:

library(dplyr)
src <- dbplyr::src_dbi(db)
geneExpDiffData <- tbl(src, "geneExpDiffData")

M4M3 <- geneExpDiffData %>%
  filter(significant == "yes" & sample_1 == "M4" & sample_2 == "M3")

M4F3 <- geneExpDiffData %>%
  filter(significant == "yes" & sample_1 == "M4" & sample_2 == "F4")

anti_join(M4M3, M4F3)

在中了解更多信息。

您可以在一条SQL语句中直接执行此操作,如下所示:

> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
      gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value   q_value significant
1 XLOC_000219       M4       M3     OK 3.85465 0.00000             -Inf        NA   5e-05 0.0075951         yes
2 XLOC_004272       M4       M3     OK 2.06687 0.00000             -Inf        NA   5e-05 0.0075951         yes
3 XLOC_004991       M4       M3     OK 3.29904 0.00000             -Inf        NA   5e-05 0.0075951         yes
4 XLOC_007234       M4       M3     OK 1.28027 0.00000             -Inf        NA   5e-05 0.0075951         yes
5 XLOC_000664       M4       F4     OK 1.46853 0.00000             -Inf        NA   5e-05 0.0075951         yes
6 XLOC_001809       M4       F4     OK 0.00000 1.91743              Inf        NA   5e-05 0.0075951         yes
M4M3 <- dbGetQuery(mydb, '
SELECT * 
FROM geneExpDiffData 
WHERE significant = "yes" 
AND sample_1 = "M4" 
AND sample_2 = "M3"
AND gene_id not in (SELECT gene_id 
                    FROM geneExpDiffData 
                    WHERE significant = "yes" 
                    AND sample_1 = "M4" 
                    AND sample_2 = "F4")
')
内括号中的代码返回M4F4中所有gene_id的表。
因此,我们需要第一个表中的所有gene_id,但不需要第二个表中的所有gene_id。

您可以在一个SQL语句中直接执行此操作,如下所示:

> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
      gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value   q_value significant
1 XLOC_000219       M4       M3     OK 3.85465 0.00000             -Inf        NA   5e-05 0.0075951         yes
2 XLOC_004272       M4       M3     OK 2.06687 0.00000             -Inf        NA   5e-05 0.0075951         yes
3 XLOC_004991       M4       M3     OK 3.29904 0.00000             -Inf        NA   5e-05 0.0075951         yes
4 XLOC_007234       M4       M3     OK 1.28027 0.00000             -Inf        NA   5e-05 0.0075951         yes
5 XLOC_000664       M4       F4     OK 1.46853 0.00000             -Inf        NA   5e-05 0.0075951         yes
6 XLOC_001809       M4       F4     OK 0.00000 1.91743              Inf        NA   5e-05 0.0075951         yes
M4M3 <- dbGetQuery(mydb, '
SELECT * 
FROM geneExpDiffData 
WHERE significant = "yes" 
AND sample_1 = "M4" 
AND sample_2 = "M3"
AND gene_id not in (SELECT gene_id 
                    FROM geneExpDiffData 
                    WHERE significant = "yes" 
                    AND sample_1 = "M4" 
                    AND sample_2 = "F4")
')
内括号中的代码返回M4F4中所有gene_id的表。
因此,我们希望所有的基因id都在第一个表中,但不在第二个表中。

当您提到M4M3和M4F4时,我不确定我是否理解您的要求。您是在同时讨论示例_1和示例_2值吗?即,选择样本_1='M4'和样本_2='M3'以及基因_id='M4F4'的值?我不是从您提供的示例数据集开始的。M4M3.new如果您查看数据表结构,我希望所有行中的sample_1=M4、sample_2=M3在比较sample_1=M4和sample_2=F4中都没有匹配的gene_id,我不确定我是否理解您在参考M4M3和M4F4时提出的问题。您是在同时讨论示例_1和示例_2值吗?即,选择样本_1='M4'和样本_2='M3'以及基因_id='M4F4'的值?我不是从您提供的示例数据集.M4M3开始的。新建如果您查看数据表结构,我希望所有行中的sample_1=M4、sample_2=M3在比较sample_1=M4和sample_2=f4中都没有匹配的gene_id,您可以将其设置为by=gene_idyou是正确的DMC。我尽可能的明确,但这里真的没有必要。对于你的反连接解决方案,你可以通过设置=gene\u idyou是正确的DMC。我尽可能地直截了当,但在这里真的没有必要。