R 数据帧中的公共元素
我有三个数据帧,包含大量信息和以下行名称:R 数据帧中的公共元素,r,dataframe,bioinformatics,intersection,R,Dataframe,Bioinformatics,Intersection,我有三个数据帧,包含大量信息和以下行名称: ENSG00000000971 ENSG00000000971 ENSG00000000971 ENSG00000004139 ENSG00000004139 ENSG00000003987 ENSG00000005001 ENSG00000004848 ENSG00000004848 ENSG00000005102 ENSG00000002330 ENSG00000002330 ENSG00000005486 ENSG00000005102 ENSG
ENSG00000000971 ENSG00000000971 ENSG00000000971
ENSG00000004139 ENSG00000004139 ENSG00000003987
ENSG00000005001 ENSG00000004848 ENSG00000004848
ENSG00000005102 ENSG00000002330 ENSG00000002330
ENSG00000005486 ENSG00000005102 ENSG00000006047
... ... ...
我想做的是找到至少两个数据帧中的所有公共项(行名称)。也就是说,最终结果应该是一个列表,如下所示:
ENSG00000000971
ENSG00000004139
ENSG00000004848
ENSG00000005102
ENSG00000002330
我该怎么做呢?我试着这样做:
shared.DESeq2.edgeR = data.frame(row.names(res.DESeq2) %in% row.names(res.edgeR))
shared.DESeq2.limma = data.frame(row.names(res.DESeq2) %in% row.names(res.limma))
shared.edgeR.limma = data.frame(row.names(res.edgeR) %in% row.names(res.limma))
shared = merge(merge(shared.DESeq2.edgeR, shared.DESeq2.limma), shared.edgeR.limma)
。。。其中三个res.[DESeq2/edgeR/limma]
是三个数据帧,但这需要很长时间才能运行(我甚至没有让它完成,所以我不知道它是否真的有效)。我有一些代码可以为所有三个数据帧共用的元素执行此操作,但我也对两个或更多数据帧共用的元素感兴趣,但我真的找不到一个好方法来执行此操作。有什么想法吗?试试这个例子:
#dummy data, with real data we would do: res.DESeq2_rn <-row.names(res.DESeq2)
res.DESeq2_rn <- letters[1:4]
res.edgeR_rn <- letters[3:8]
res.limma_rn <- letters[c(1,3,8,10)]
#get counts
res <- table(c(res.DESeq2_rn, res.edgeR_rn, res.limma_rn))
res
# a b c d e f g h j
# 2 1 3 2 1 1 1 2 1
#result
names(res)[ res>=2 ]
#[1] "a" "c" "d" "h"
#虚拟数据,我们将使用真实数据:res.DESeq2_rn另一种方法,采用@zx8754的样本数据:
# dummy data
res.DESeq2 <- letters[ 1:4 ]
res.edgeR <- letters[ 3:8 ]
res.limma <- letters[ c( 1, 3, 8, 10 ) ]
# combine into one vector
res <- c( res.DESeq2, res.edgeR, res.limma )
res
[1] "a" "b" "c" "d" "c" "d" "e" "f" "g" "h" "a" "c" "h" "j"
# result
unique( res[ which( duplicated( res ) ) ] )
[1] "c" "d" "a" "h"
#虚拟数据
res.DESeq2是否有任何数据帧包含重复项?否,任何数据帧中都没有重复的行名称。是的,基准测试表明您的方法最快。见我编辑的帖子。
# dummy data
res.DESeq2 <- letters[ 1:4 ]
res.edgeR <- letters[ 3:8 ]
res.limma <- letters[ c( 1, 3, 8, 10 ) ]
# combine into one vector
res <- c( res.DESeq2, res.edgeR, res.limma )
res
[1] "a" "b" "c" "d" "c" "d" "e" "f" "g" "h" "a" "c" "h" "j"
# result
unique( res[ which( duplicated( res ) ) ] )
[1] "c" "d" "a" "h"
# create a large random character vector (this takes a lot of time!)
res <- rep( "x", 1000000 )
for( i in 1:1000000)
res[ i ] <- paste( sample( letters, 8, replace = TRUE ), collapse = "" )
head( res )
[1] "vsvkljgr" "ulxhqnas" "upqqtrdk" "pynuaihp" "srjtnvqm" "mxnlytvd"
# vaettchen:
system.time( x <- unique( res[ which( duplicated( res ) ) ] ) )
user system elapsed
0.173 0.000 0.171
x
[1] "zlzlwinb" "wielycpx"
# zx8754
system.time( { y <- table( res ); z <- names( y )[ y >= 2 ] } )
user system elapsed
18.945 0.020 19.058
z
[1] "wielycpx" "zlzlwinb"