R:通过计算另一个数据帧中CSV列中字符串的出现次数,将计数出现列添加到数据帧
我有一个数据帧R:通过计算另一个数据帧中CSV列中字符串的出现次数,将计数出现列添加到数据帧,r,count,R,Count,我有一个数据帧df1: df1 <- structure(list(Id = c(0, 1, 3, 4), Support = c(17, 15, 10, 18 ), Genes = structure(c(3L, 1L, 4L, 2L), .Label = c("BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1", "CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4", "FOS,BCL2,PIK3CD,NFKBIA
df1
:
df1 <- structure(list(Id = c(0, 1, 3, 4), Support = c(17, 15, 10, 18
), Genes = structure(c(3L, 1L, 4L, 2L), .Label = c("BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1",
"CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4", "FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B",
"MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
如何通过计算Genes
列中df2
中每个字符串的出现次数,在df1
中创建一个新列,以获得所需的输出
Id | Support | Genes | Counts |
---------------------------------------------------------
0 | 17 |FOS,BCL2,... | 2 |
1 | 15 |BMP2,TFGB1,...| 3 |
3 | 10 |MAPK12,YWHAE..| 1 |
4 | 18 |CBLC,TGFA,... | 4 |
可能有一个更优雅的解决方案,但这确实起到了作用
df$Counts <- unlist(lapply(df$Genes, function(x){
xx <- unlist(strsplit(as.character(x),split = ","))
sum(df2$V1 %in% xx)
}))
(我认为在您上面的示例中,第三行的
计数应该是2
而不是1
?)这里是使用stringr库的另一个选项。这将循环来自df的Genes列,并使用df2数据帧作为模式
#convert factors columns into characters
df$Genes<-as.character(df$Genes)
df2$V1<-as.character(df2$V1)
library(stringr)
#loop over the strings against the pattern from df2
df$Counts<-sapply(df$Genes, function(x){
sum(str_count(x, df2$V1))
})
df
Id Support Genes Counts
1 0 17 FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B 2
2 1 15 BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1 3
3 3 10 MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD 2
4 4 18 CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4 4
#将因子列转换为字符
df$GenesYou是正确的,这是我的错别字,谢谢你的回答!
Id Support Genes Counts
1 0 17 FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B 2
2 1 15 BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1 3
3 3 10 MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD 2
4 4 18 CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4 4
#convert factors columns into characters
df$Genes<-as.character(df$Genes)
df2$V1<-as.character(df2$V1)
library(stringr)
#loop over the strings against the pattern from df2
df$Counts<-sapply(df$Genes, function(x){
sum(str_count(x, df2$V1))
})
df
Id Support Genes Counts
1 0 17 FOS,BCL2,PIK3CD,NFKBIA,TNFRSF10B 2
2 1 15 BMP2,TGFB1,BMP3,MAPK12,GDF11,MAPK13,CITED1 3
3 3 10 MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4,PIK3CD 2
4 4 18 CBLC,TGFA,MAPK12,YWHAE,YWHAQ,MAPK13,SPRY4 4