具有特定值R的列
我有一个3列的数据框架(基因、varian_类型和样本)和两列中的另一列(路径和基因)。在第二篇文章中,我列出了每种途径的基因列表。所以现在我想创建一个由4列组成的新数据框架(基因、变异类型、样本和路径),显示每个基因存在的一个或多个路径。有人能帮我吗?提前谢谢 (一) (二) 3) 我不会做这样的事具有特定值R的列,r,dataframe,filtering,R,Dataframe,Filtering,我有一个3列的数据框架(基因、varian_类型和样本)和两列中的另一列(路径和基因)。在第二篇文章中,我列出了每种途径的基因列表。所以现在我想创建一个由4列组成的新数据框架(基因、变异类型、样本和路径),显示每个基因存在的一个或多个路径。有人能帮我吗?提前谢谢 (一) (二) 3) 我不会做这样的事 structure(list(Hugo_Symbol = c("ZAP70", "TTN", "TTN", "PRKCD", "PIK3CA", "TLR3"), Variant_Typ
structure(list(Hugo_Symbol = c("ZAP70", "TTN", "TTN", "PRKCD",
"PIK3CA", "TLR3"), Variant_Type = c("SNP", "SNP", "SNP", "SNP",
"SNP", "SNP"), Tumor_Sample_Barcode = c("TCGA-E9-A1RC-01A-11D-A159-09",
"TCGA-E9-A1RC-01A-11D-A159-09", "TCGA-E9-A1RC-01A-11D-A159-09",
"TCGA-E9-A1RC-01A-11D-A159-09", "TCGA-E9-A1RC-01A-11D-A159-09",
"TCGA-E9-A1RC-01A-11D-A159-09"), Pathways = c("hsa04014__44, hsa04014__33, hsa04014__37, hsa04014__24",
"hsa04530__11 20 16", "hsa04530__11 20 16", "hsa04722__37, hsa04722__35, hsa04722__33",
"hsa04151__25, hsa04151__37, hsa04151__73", "hsa04620__23")), row.names = c("6",
"8", "9", "11", "13", "16"), class = "data.frame")
更新-更改了解决方案方法,使其也能处理OP指出的情况。即,如果
Hugo_Symbol
是NF1
,则逻辑不应与NF11
或NF12
library(dplyr)
library(tidyr)
df1 %>%
mutate(Hugo_Symbol = as.character(Hugo_Symbol)) %>% #convert factor to character variable
left_join(df2 %>%
separate_rows(mutated, sep = ','),
by = c("Hugo_Symbol" = "mutated")) %>%
group_by(Hugo_Symbol, Variant_Type, Tumor_Sample_Barcode) %>%
summarise(Pathways = paste(unique(circuit_names), collapse = ",")) #combine distinct values in Pathways
给
Hugo_Symbol Variant_Type Tumor_Sample_Barcode Pathways
1 NF1 SNP TCGA-E9-A1RC-01A-11D-A159-09 hsa04014__44,hsa04014__33
样本数据:
df1 <- structure(list(Hugo_Symbol = "NF1", Variant_Type = "SNP", Tumor_Sample_Barcode = "TCGA-E9-A1RC-01A-11D-A159-09"), .Names = c("Hugo_Symbol",
"Variant_Type", "Tumor_Sample_Barcode"), class = "data.frame", row.names = "1")
df2 <- structure(list(circuit_names = c("hsa04014__44", "hsa04014__33",
"hsa04014__37", "hsa04014__24"), mutated = c("ZAP70,NF1,MAPK1,RAF1,CSF1R,RASGRP1,MAP2K1",
"ZAP70,NF1,AKT3,CSF1R,BAD,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1",
"ZAP70,NF11,AKT3,CSF1R,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1,RASGRF",
"ZAP70,NF12,CSF1R,RGL2,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1"
)), .Names = c("circuit_names", "mutated"), class = "data.frame", row.names = c("1",
"2", "3", "4"))
df1您的数据格式不太清楚。你是在R进口的吗?如果是这样,只需使用dput
来共享相关部分。添加dput()
output。对不起,在哪里?您可以dput(head(df))
并将输出粘贴到此处。非常感谢您。您真的帮助了我!嗨,普莱姆,我意识到我对你的功能有问题。问题是,例如,如果我有MAP3K1(Hugo_符号),那么函数与MAP3K11、MAP3K14或AMAP3K11匹配,因此新列将不正确。你能帮我吗?你好,普莱姆,谢谢你,但是有点不对劲。这是警告信息:列Hugo_Symbol
/突变
连接因子和字符向量,强制转换为字符向量是的,但还有另一个问题……现在在“路径”中存在重复或三重路径。
Hugo_Symbol Variant_Type Tumor_Sample_Barcode Pathways
1 NF1 SNP TCGA-E9-A1RC-01A-11D-A159-09 hsa04014__44,hsa04014__33
df1 <- structure(list(Hugo_Symbol = "NF1", Variant_Type = "SNP", Tumor_Sample_Barcode = "TCGA-E9-A1RC-01A-11D-A159-09"), .Names = c("Hugo_Symbol",
"Variant_Type", "Tumor_Sample_Barcode"), class = "data.frame", row.names = "1")
df2 <- structure(list(circuit_names = c("hsa04014__44", "hsa04014__33",
"hsa04014__37", "hsa04014__24"), mutated = c("ZAP70,NF1,MAPK1,RAF1,CSF1R,RASGRP1,MAP2K1",
"ZAP70,NF1,AKT3,CSF1R,BAD,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1",
"ZAP70,NF11,AKT3,CSF1R,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1,RASGRF",
"ZAP70,NF12,CSF1R,RGL2,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1"
)), .Names = c("circuit_names", "mutated"), class = "data.frame", row.names = c("1",
"2", "3", "4"))