R:如何从另一个表中的表中删除值?
我有如下数据:R:如何从另一个表中的表中删除值?,r,rsqlite,R,Rsqlite,我有如下数据: > head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"')) gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value q_value significant 1 XLOC_000219 M4 M3 OK 3.85
> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value q_value significant
1 XLOC_000219 M4 M3 OK 3.85465 0.00000 -Inf NA 5e-05 0.0075951 yes
2 XLOC_004272 M4 M3 OK 2.06687 0.00000 -Inf NA 5e-05 0.0075951 yes
3 XLOC_004991 M4 M3 OK 3.29904 0.00000 -Inf NA 5e-05 0.0075951 yes
4 XLOC_007234 M4 M3 OK 1.28027 0.00000 -Inf NA 5e-05 0.0075951 yes
5 XLOC_000664 M4 F4 OK 1.46853 0.00000 -Inf NA 5e-05 0.0075951 yes
6 XLOC_001809 M4 F4 OK 0.00000 1.91743 Inf NA 5e-05 0.0075951 yes
M4M3 <- dbGetQuery(mydb, '
SELECT *
FROM geneExpDiffData
WHERE significant = "yes"
AND sample_1 = "M4"
AND sample_2 = "M3"
AND gene_id not in (SELECT gene_id
FROM geneExpDiffData
WHERE significant = "yes"
AND sample_1 = "M4"
AND sample_2 = "F4")
')
我用RSQLite生成了两个子集:
M4M3 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "M3"')
M4F4 <- dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes" AND sample_1 = "M4" AND sample_2 = "F4"')
我想从M4M3中删除所有在M4F4中具有匹配gene_id的值。我使用RSQLite来过滤数据集并不重要,它可能是一个纯R解决方案,但我不确定如何比较表并从一个表中删除另一个表中的行
谢谢你的建议 有很多方法可以做到这一点 如上Balter所述的基本R子集解决方案:
M4M3.new <- M4M3[!(M4M3$gene_id %in% M4F4$gene_id),]
基R集并集解决方案:
M4M3.new <- setdiff(M4M3, M4F4)
Dplyr溶液
M4M3.new <- dplyr::anti_join(M4M3,
M4F4,
by = c("gene_id" = "gene_id"))
编辑:在以下数据集上测试的所有数据似乎都正常工作:
tst1 <- data.frame(gene_id = seq(1:10),
sample_1 = rep("M4", 10),
sample_2 = c(rep("M3", 6), rep("F4", 4)),
other_values = sample(1:10, 10, replace = T),
other_values2 = rep("OK", 10))
M4M3 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2 == "M3",]
M4F4 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2 == "F4",]
有很多方法可以做到这一点 如上Balter所述的基本R子集解决方案:
M4M3.new <- M4M3[!(M4M3$gene_id %in% M4F4$gene_id),]
基R集并集解决方案:
M4M3.new <- setdiff(M4M3, M4F4)
Dplyr溶液
M4M3.new <- dplyr::anti_join(M4M3,
M4F4,
by = c("gene_id" = "gene_id"))
编辑:在以下数据集上测试的所有数据似乎都正常工作:
tst1 <- data.frame(gene_id = seq(1:10),
sample_1 = rep("M4", 10),
sample_2 = c(rep("M3", 6), rep("F4", 4)),
other_values = sample(1:10, 10, replace = T),
other_values2 = rep("OK", 10))
M4M3 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2 == "M3",]
M4F4 <- tst1[tst1$sample_1 == "M4" & tst1$sample_2 == "F4",]
如果希望联接在数据库上运行,还可以通过dbplyr进行连接:
library(dplyr)
src <- dbplyr::src_dbi(db)
geneExpDiffData <- tbl(src, "geneExpDiffData")
M4M3 <- geneExpDiffData %>%
filter(significant == "yes" & sample_1 == "M4" & sample_2 == "M3")
M4F3 <- geneExpDiffData %>%
filter(significant == "yes" & sample_1 == "M4" & sample_2 == "F4")
anti_join(M4M3, M4F3)
在中了解更多信息。如果希望在数据库上运行连接,还可以通过dbplyr进行连接:
library(dplyr)
src <- dbplyr::src_dbi(db)
geneExpDiffData <- tbl(src, "geneExpDiffData")
M4M3 <- geneExpDiffData %>%
filter(significant == "yes" & sample_1 == "M4" & sample_2 == "M3")
M4F3 <- geneExpDiffData %>%
filter(significant == "yes" & sample_1 == "M4" & sample_2 == "F4")
anti_join(M4M3, M4F3)
在中了解更多信息。您可以在一条SQL语句中直接执行此操作,如下所示:
> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value q_value significant
1 XLOC_000219 M4 M3 OK 3.85465 0.00000 -Inf NA 5e-05 0.0075951 yes
2 XLOC_004272 M4 M3 OK 2.06687 0.00000 -Inf NA 5e-05 0.0075951 yes
3 XLOC_004991 M4 M3 OK 3.29904 0.00000 -Inf NA 5e-05 0.0075951 yes
4 XLOC_007234 M4 M3 OK 1.28027 0.00000 -Inf NA 5e-05 0.0075951 yes
5 XLOC_000664 M4 F4 OK 1.46853 0.00000 -Inf NA 5e-05 0.0075951 yes
6 XLOC_001809 M4 F4 OK 0.00000 1.91743 Inf NA 5e-05 0.0075951 yes
M4M3 <- dbGetQuery(mydb, '
SELECT *
FROM geneExpDiffData
WHERE significant = "yes"
AND sample_1 = "M4"
AND sample_2 = "M3"
AND gene_id not in (SELECT gene_id
FROM geneExpDiffData
WHERE significant = "yes"
AND sample_1 = "M4"
AND sample_2 = "F4")
')
内括号中的代码返回M4F4中所有gene_id的表。
因此,我们需要第一个表中的所有gene_id,但不需要第二个表中的所有gene_id。您可以在一个SQL语句中直接执行此操作,如下所示:
> head(dbGetQuery(mydb, 'SELECT * FROM geneExpDiffData WHERE significant = "yes"'))
gene_id sample_1 sample_2 status value_1 value_2 log2_fold_change test_stat p_value q_value significant
1 XLOC_000219 M4 M3 OK 3.85465 0.00000 -Inf NA 5e-05 0.0075951 yes
2 XLOC_004272 M4 M3 OK 2.06687 0.00000 -Inf NA 5e-05 0.0075951 yes
3 XLOC_004991 M4 M3 OK 3.29904 0.00000 -Inf NA 5e-05 0.0075951 yes
4 XLOC_007234 M4 M3 OK 1.28027 0.00000 -Inf NA 5e-05 0.0075951 yes
5 XLOC_000664 M4 F4 OK 1.46853 0.00000 -Inf NA 5e-05 0.0075951 yes
6 XLOC_001809 M4 F4 OK 0.00000 1.91743 Inf NA 5e-05 0.0075951 yes
M4M3 <- dbGetQuery(mydb, '
SELECT *
FROM geneExpDiffData
WHERE significant = "yes"
AND sample_1 = "M4"
AND sample_2 = "M3"
AND gene_id not in (SELECT gene_id
FROM geneExpDiffData
WHERE significant = "yes"
AND sample_1 = "M4"
AND sample_2 = "F4")
')
内括号中的代码返回M4F4中所有gene_id的表。
因此,我们希望所有的基因id都在第一个表中,但不在第二个表中。当您提到M4M3和M4F4时,我不确定我是否理解您的要求。您是在同时讨论示例_1和示例_2值吗?即,选择样本_1='M4'和样本_2='M3'以及基因_id='M4F4'的值?我不是从您提供的示例数据集开始的。M4M3.new如果您查看数据表结构,我希望所有行中的sample_1=M4、sample_2=M3在比较sample_1=M4和sample_2=F4中都没有匹配的gene_id,我不确定我是否理解您在参考M4M3和M4F4时提出的问题。您是在同时讨论示例_1和示例_2值吗?即,选择样本_1='M4'和样本_2='M3'以及基因_id='M4F4'的值?我不是从您提供的示例数据集.M4M3开始的。新建如果您查看数据表结构,我希望所有行中的sample_1=M4、sample_2=M3在比较sample_1=M4和sample_2=f4中都没有匹配的gene_id,您可以将其设置为by=gene_idyou是正确的DMC。我尽可能的明确,但这里真的没有必要。对于你的反连接解决方案,你可以通过设置=gene\u idyou是正确的DMC。我尽可能地直截了当,但在这里真的没有必要。