具有特定值R的列

具有特定值R的列,r,dataframe,filtering,R,Dataframe,Filtering,我有一个3列的数据框架(基因、varian_类型和样本)和两列中的另一列(路径和基因)。在第二篇文章中,我列出了每种途径的基因列表。所以现在我想创建一个由4列组成的新数据框架(基因、变异类型、样本和路径),显示每个基因存在的一个或多个路径。有人能帮我吗?提前谢谢 (一) (二) 3) 我不会做这样的事 structure(list(Hugo_Symbol = c("ZAP70", "TTN", "TTN", "PRKCD", "PIK3CA", "TLR3"), Variant_Typ

我有一个3列的数据框架(基因、varian_类型和样本)和两列中的另一列(路径和基因)。在第二篇文章中,我列出了每种途径的基因列表。所以现在我想创建一个由4列组成的新数据框架(基因、变异类型、样本和路径),显示每个基因存在的一个或多个路径。有人能帮我吗?提前谢谢

(一)

(二)

3) 我不会做这样的事

    structure(list(Hugo_Symbol = c("ZAP70", "TTN", "TTN", "PRKCD", 
"PIK3CA", "TLR3"), Variant_Type = c("SNP", "SNP", "SNP", "SNP", 
"SNP", "SNP"), Tumor_Sample_Barcode = c("TCGA-E9-A1RC-01A-11D-A159-09", 
"TCGA-E9-A1RC-01A-11D-A159-09", "TCGA-E9-A1RC-01A-11D-A159-09", 
"TCGA-E9-A1RC-01A-11D-A159-09", "TCGA-E9-A1RC-01A-11D-A159-09", 
"TCGA-E9-A1RC-01A-11D-A159-09"), Pathways = c("hsa04014__44, hsa04014__33, hsa04014__37, hsa04014__24", 
"hsa04530__11 20 16", "hsa04530__11 20 16", "hsa04722__37, hsa04722__35, hsa04722__33", 
"hsa04151__25, hsa04151__37, hsa04151__73", "hsa04620__23")), row.names = c("6", 
"8", "9", "11", "13", "16"), class = "data.frame")

更新-更改了解决方案方法,使其也能处理OP指出的情况。即,如果
Hugo_Symbol
NF1
,则逻辑不应与
NF11
NF12

library(dplyr)
library(tidyr)

df1  %>%
  mutate(Hugo_Symbol = as.character(Hugo_Symbol)) %>%   #convert factor to character variable
  left_join(df2 %>%
              separate_rows(mutated, sep = ','), 
            by = c("Hugo_Symbol" = "mutated")) %>%
  group_by(Hugo_Symbol, Variant_Type, Tumor_Sample_Barcode) %>%
  summarise(Pathways = paste(unique(circuit_names), collapse = ","))   #combine distinct values in Pathways

  Hugo_Symbol Variant_Type Tumor_Sample_Barcode         Pathways                                     
1 NF1         SNP          TCGA-E9-A1RC-01A-11D-A159-09 hsa04014__44,hsa04014__33

样本数据:

df1 <- structure(list(Hugo_Symbol = "NF1", Variant_Type = "SNP", Tumor_Sample_Barcode = "TCGA-E9-A1RC-01A-11D-A159-09"), .Names = c("Hugo_Symbol", 
"Variant_Type", "Tumor_Sample_Barcode"), class = "data.frame", row.names = "1")

df2 <- structure(list(circuit_names = c("hsa04014__44", "hsa04014__33", 
"hsa04014__37", "hsa04014__24"), mutated = c("ZAP70,NF1,MAPK1,RAF1,CSF1R,RASGRP1,MAP2K1", 
"ZAP70,NF1,AKT3,CSF1R,BAD,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1", 
"ZAP70,NF11,AKT3,CSF1R,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1,RASGRF", 
"ZAP70,NF12,CSF1R,RGL2,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1"
)), .Names = c("circuit_names", "mutated"), class = "data.frame", row.names = c("1", 
"2", "3", "4"))

df1您的数据格式不太清楚。你是在R进口的吗?如果是这样,只需使用
dput
来共享相关部分。添加
dput()
output。对不起,在哪里?您可以
dput(head(df))
并将输出粘贴到此处。非常感谢您。您真的帮助了我!嗨,普莱姆,我意识到我对你的功能有问题。问题是,例如,如果我有MAP3K1(Hugo_符号),那么函数与MAP3K11、MAP3K14或AMAP3K11匹配,因此新列将不正确。你能帮我吗?你好,普莱姆,谢谢你,但是有点不对劲。这是警告信息:列
Hugo_Symbol
/
突变
连接因子和字符向量,强制转换为字符向量是的,但还有另一个问题……现在在“路径”中存在重复或三重路径。
  Hugo_Symbol Variant_Type Tumor_Sample_Barcode         Pathways                                     
1 NF1         SNP          TCGA-E9-A1RC-01A-11D-A159-09 hsa04014__44,hsa04014__33
df1 <- structure(list(Hugo_Symbol = "NF1", Variant_Type = "SNP", Tumor_Sample_Barcode = "TCGA-E9-A1RC-01A-11D-A159-09"), .Names = c("Hugo_Symbol", 
"Variant_Type", "Tumor_Sample_Barcode"), class = "data.frame", row.names = "1")

df2 <- structure(list(circuit_names = c("hsa04014__44", "hsa04014__33", 
"hsa04014__37", "hsa04014__24"), mutated = c("ZAP70,NF1,MAPK1,RAF1,CSF1R,RASGRP1,MAP2K1", 
"ZAP70,NF1,AKT3,CSF1R,BAD,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1", 
"ZAP70,NF11,AKT3,CSF1R,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1,RASGRF", 
"ZAP70,NF12,CSF1R,RGL2,RASGRP1,RASGRF1,RASGRF1,RASGRF1,RASGRF1"
)), .Names = c("circuit_names", "mutated"), class = "data.frame", row.names = c("1", 
"2", "3", "4"))