R 根据另一数据帧的一个数据帧的子集列';s行
我想根据另一个数据框的行对其某些列进行子集划分。因此,两个数据帧如下所示:R 根据另一数据帧的一个数据帧的子集列';s行,r,subset,bioinformatics,R,Subset,Bioinformatics,我想根据另一个数据框的行对其某些列进行子集划分。因此,两个数据帧如下所示: df1 <- structure(list(ID = structure(c(3L, 1L, 2L, 5L, 4L), .Label = c("cg08", "cg09", "cg29", "cg36", "cg65"), class = "factor"), chr = c(16L, 3L, 3L, 1L, 8L), gene = c(534L, 376L, 171L, 911L, 422L), GS12 = c
df1 <- structure(list(ID = structure(c(3L, 1L, 2L, 5L, 4L), .Label = c("cg08", "cg09", "cg29", "cg36", "cg65"), class = "factor"), chr = c(16L, 3L, 3L, 1L, 8L), gene = c(534L, 376L, 171L, 911L, 422L), GS12 = c(0.15, 0.87, 0.6, 0.1, 0.72), GS32 = c(0.44, 0.93, 0.92, 0.07, 0.91), GS56 = c(0.46, 0.92, 0.62, 0.06, 0.87), GS87 = c(0.79, 0.93, 0.86, 0.08, 0.88)), .Names = c("ID", "chr", "gene", "GS12", "GS32", "GS56", "GS87"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
df2 <- structure(list(samples = structure(c(1L, 2L, 4L, 3L, 6L, 5L), .Label = c("GS32", "GS33", "GS55", "GS56", "GS68", "GS87"), class = "factor"), ID2 = structure(c(1L, 6L, 3L, 4L, 5L, 2L), .Label = c("GM1", "GM10", "GM17", "GM18", "GM19", "GM7"), class = "factor")), .Names = c("samples", "ID2" ), class = "data.frame", row.names = c(NA, -6L))
我想将df1中的所有列(同时保留最终输出中的所有行)子集,这些列在df2的ID列中是通用的。简言之,我想根据另一个数据帧的行将一个数据帧的列子集,是否有任何函数可以做到这一点?输入数据:
df1 <- structure(list(ID = structure(c(3L, 1L, 2L, 5L, 4L), .Label = c("cg08", "cg09", "cg29", "cg36", "cg65"), class = "factor"), chr = c(16L, 3L, 3L, 1L, 8L), gene = c(534L, 376L, 171L, 911L, 422L), GS12 = c(0.15, 0.87, 0.6, 0.1, 0.72), GS32 = c(0.44, 0.93, 0.92, 0.07, 0.91), GS56 = c(0.46, 0.92, 0.62, 0.06, 0.87), GS87 = c(0.79, 0.93, 0.86, 0.08, 0.88)), .Names = c("ID", "chr", "gene", "GS12", "GS32", "GS56", "GS87"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
df2 <- structure(list(samples = structure(c(1L, 2L, 4L, 3L, 6L, 5L), .Label = c("GS32", "GS33", "GS55", "GS56", "GS68", "GS87"), class = "factor"), ID2 = structure(c(1L, 6L, 3L, 4L, 5L, 2L), .Label = c("GM1", "GM10", "GM17", "GM18", "GM19", "GM7"), class = "factor")), .Names = c("samples", "ID2" ), class = "data.frame", row.names = c(NA, -6L))
您正在检查df1中的哪些列名出现在df2的示例中。但是,我假设您还需要输出数据框中的ID、染色体和基因,这可以通过以下方式完成:
df1[colnames(df1) %in% df2$samples]
# GS32 GS56 GS87
#1 0.44 0.46 0.79
#2 0.93 0.92 0.93
#3 0.92 0.62 0.86
#4 0.07 0.06 0.08
#5 0.91 0.87 0.88
df1[c(1:3, colnames(df1) %in% df2$samples)]
# ID chr gene ID.1 ID.2 ID.3
#1 cg29 16 534 cg29 cg29 cg29
#2 cg08 3 376 cg08 cg08 cg08
#3 cg09 3 171 cg09 cg09 cg09
#4 cg65 1 911 cg65 cg65 cg65
#5 cg36 8 422 cg36 cg36 cg36
如果要强制列顺序与以前相同,请使用
match
而不是%中的%。至少需要两个变量,第一个是目标向量,第二个是需要排序的数据帧/向量
df1[,c(1:3,na.omit(match(df2$samples, colnames(df1))))]
# ID chr gene GS32 GS56 GS87
#1 cg29 16 534 0.44 0.46 0.79
#2 cg08 3 376 0.93 0.92 0.93
#3 cg09 3 171 0.92 0.62 0.86
#4 cg65 1 911 0.07 0.06 0.08
#5 cg36 8 422 0.91 0.87 0.88
您期望的结果是什么?请尝试df1[intersect(名称(df1),df2$samples)]
如果df2$samples
是factor
使用as.character(df2$samples)
我将查看data.table包和函数foverlaps。也许给我的这个答案也会对你有所帮助:
df1[,c(1:3,na.omit(match(df2$samples, colnames(df1))))]
# ID chr gene GS32 GS56 GS87
#1 cg29 16 534 0.44 0.46 0.79
#2 cg08 3 376 0.93 0.92 0.93
#3 cg09 3 171 0.92 0.62 0.86
#4 cg65 1 911 0.07 0.06 0.08
#5 cg36 8 422 0.91 0.87 0.88