（tidyverse）按组为给定位置的列中的每个字符串提取子字符串，以包括_R_Tidyverse

（tidyverse）按组为给定位置的列中的每个字符串提取子字符串，以包括

（tidyverse）按组为给定位置的列中的每个字符串提取子字符串，以包括,r,tidyverse,R,Tidyverse,我有一个DNA比对的数据框。每个比对都有一个标签，可以由3个或更多分离物组成。我的目标是对对齐列进行变异，使其消除每个对齐中隔离1、3和4中所有间隙（以“-”号表示）的位置。所有路线中始终包含隔离体1、3和4，有时只有这三个将位于路线中我所拥有的： test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--atc-a","at----a","--ataga","

我有一个DNA比对的数据框。每个比对都有一个标签，可以由3个或更多分离物组成。我的目标是对对齐列进行变异，使其消除每个对齐中隔离1、3和4中所有间隙（以“-”号表示）的位置。所有路线中始终包含隔离体1、3和4，有时只有这三个将位于路线中
我所拥有的：

test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--atc-a","at----a","--ataga","--attga","a---ggg","acgttgg","a---tgg","a---tgg", "aggatgg")) > test_df isolate label alignment 1 1 1 --atc-a 2 2 1 at----a 3 3 1 --ataga 4 4 1 --attga 5 1 2 a---ggg 6 2 2 acgttgg 7 3 2 a---tgg 8 4 2 a---tgg 9 5 2 aggatgg

> test_df isolate label alignment 1 1 1 atc-a 2 2 1 ----a 3 3 1 ataga 4 4 1 attga 5 1 2 aggg 6 2 2 atgg 7 3 2 atgg 8 4 2 atgg 9 5 2 atgg
我所尝试的：

test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--atc-a","at----a","--ataga","--attga","a---ggg","acgttgg","a---tgg","a---tgg", "aggatgg")) > test_df isolate label alignment 1 1 1 --atc-a 2 2 1 at----a 3 3 1 --ataga 4 4 1 --attga 5 1 2 a---ggg 6 2 2 acgttgg 7 3 2 a---tgg 8 4 2 a---tgg 9 5 2 aggatgg

> test_df isolate label alignment 1 1 1 atc-a 2 2 1 ----a 3 3 1 ataga 4 4 1 attga 5 1 2 aggg 6 2 2 atgg 7 3 2 atgg 8 4 2 atgg 9 5 2 atgg
我可以获得我希望为每条路线保留的站点列表，如下所示：

library(tidyverse) library(stringr) test_df %>% mutate(positions=str_locate_all(alignment, "[^-]")) %>% group_by(label) %>% filter(isolate %in% c(1,3,4)) %>% summarise(pos_to_keep=list(unique(unlist(Reduce(rbind, positions)))))

但是我不确定如何继续分割所有的路线。
这是我可以得到您的解决方案的一种方法。也许有更快的办法

library(dplyr) test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--atc-a","at----a","--ataga","--attga","a---ggg","acgttgg","a---tgg","a---tgg", "aggatgg"),stringsAsFactors = FALSE) # Get the correct positions labelGroups <- test_df %>% mutate(positions=(str_locate_all(alignment, "[^-]"))) %>% filter(isolate %in% c(1,3,4)) %>% group_by(label) %>% summarise(pos_to_keep=list(unique(sort(unlist(positions))))) # Make a function to extract the relevant letters getletters <- function(wordlist,indexlist){n <- length(indexlist);lapply(1:n,function(i) paste0(sapply(indexlist[[i]], function(x) substr(wordlist[i],x,x)),collapse=""))} # Try it test_df %>% left_join(labelGroups,by="label") %>% mutate(newAlignment=getletters(alignment,pos_to_keep)) # isolate label alignment pos_to_keep newAlignment # 1 1 1 --atc-a 3, 4, 5, 6, 7 atc-a # 2 2 1 at----a 3, 4, 5, 6, 7 ----a # 3 3 1 --ataga 3, 4, 5, 6, 7 ataga # 4 4 1 --attga 3, 4, 5, 6, 7 attga # 5 1 2 a---ggg 1, 5, 6, 7 aggg # 6 2 2 acgttgg 1, 5, 6, 7 atgg # 7 3 2 a---tgg 1, 5, 6, 7 atgg # 8 4 2 a---tgg 1, 5, 6, 7 atgg # 9 5 2 aggatgg 1, 5, 6, 7 atgg

库（dplyr） test_df%filter（隔离%c（1,3,4）中的%s））%%>%group_by（label）%%>%summary（pos_to_keep=list（唯一）（排序（未列出（positions）'）） #制作一个函数来提取相关的字母 getletters%变异（新对齐=getletters（对齐，位置保持）） #隔离标签对齐位置以保持新对齐 #1--atc-a 3,4,5,6,7 atc-a #2-2-1 at----a 3,4,5,6,7----a #3-3-1-ataga 3,4,5,6,7 ataga #4-4-1-attga 3,4,5,6,7 attga #512A---GGG1，5，6，7 aggg #6 2 acgttgg 1,5,6,7 atgg #7 3 2 a---tgg 1、5、6、7 atgg #8 4 2 a---tgg 1、5、6、7 atgg #9.5.2总计1、5、6、7总计
imo有点不清楚您是如何获得预期结果的。你能用文字解释一下
--atc-a
如何转到
atc-a
，以及
at---a
如何转到
--a
等吗please@elsherbini太棒了！很高兴我能帮忙。