R 标题部分相同的组合字符串
我有这样一个文件:R 标题部分相同的组合字符串,r,R,我有这样一个文件: >mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p UGAGGUAGUAGUUUGUACAGUU >mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p ACUGUACAGGCCACUGCCUUGC >mmu-let-7i-5p MIMAT0000122 Mus musculus let-7i-5p UGAGGUAGUAGUUUGUGCUGUU >mmu-l
>mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p
ACUGUACAGGCCACUGCCUUGC
>mmu-let-7i-5p MIMAT0000122 Mus musculus let-7i-5p
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7i-3p MIMAT0004520 Mus musculus let-7i-3p
CUGCGCAAGCUACUGCCUUGCU
....
....
我想根据标题的这一部分组合具有相同标题的字符串mmu-let-7g、mmu-let-7i等
输出:
>mmu-let-7g
UGAGGUAGUAGUUUGUACAGUU ACUGUACAGGCCACUGCCUUGC
>mmu-let-7i
UGAGGUAGUAGUUUGUGCUGUU CUGCGCAAGCUACUGCCUUGCU
您可以使用readLines
读取文件,删除以“-”(“lines1”)开头的“lines”的后缀部分。这只会删除标题行的后缀。创建一个TRUE/FALSE的索引('indx')。将标题行与基线分开,使用按“标题”分组的聚合函数(tapply
)并将基线粘贴在一起。将“v1”重新排列为“v2”,可以得到预期的结果
第一步是读取多行格式。如果您还提供一个列表作为what参数(并且您使用一个命名列表),则scan
函数允许此操作。这适用于转换为数据帧:
> dat <- as.data.frame( scan(what =list( V1="", V2="", V3="", V4="", V5="", V6=""), multi.line=TRUE) )
1: >mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p
1: UGAGGUAGUAGUUUGUACAGUU
2: >mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p
2: ACUGUACAGGCCACUGCCUUGC
3: >mmu-let-7i-5p MIMAT0000122 Mus musculus let-7i-5p
3: UGAGGUAGUAGUUUGUGCUGUU
4: >mmu-let-7i-3p MIMAT0004520 Mus musculus let-7i-3p
4: CUGCGCAAGCUACUGCCUUGCU
5:
Read 4 records
lines2 <- unlist(lapply(split(lines1, cumsum(grepl('>', lines1))),
function(x) c(x[1],paste(x[-1], collapse=''))),
use.names=FALSE)
v1 <- tapply(lines2[!indx], lines2[indx], FUN=paste, collapse=' ')
v2 <- c(rbind(names(v1), unname(v1)))
v2
#[1] ">mmu-let-7g"
#[2] "UGAGGUAGUAGUUUGUACAGUU ACUGUACAGGCCACUGCCUUGC"
#[3] ">mmu-let-7i"
#[4] "UGAGGUAGUAGUUUGUGCUGUU CUGCGCAAGCUACUGCCUUGCU"
> dat <- as.data.frame( scan(what =list( V1="", V2="", V3="", V4="", V5="", V6=""), multi.line=TRUE) )
1: >mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p
1: UGAGGUAGUAGUUUGUACAGUU
2: >mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p
2: ACUGUACAGGCCACUGCCUUGC
3: >mmu-let-7i-5p MIMAT0000122 Mus musculus let-7i-5p
3: UGAGGUAGUAGUUUGUGCUGUU
4: >mmu-let-7i-3p MIMAT0004520 Mus musculus let-7i-3p
4: CUGCGCAAGCUACUGCCUUGCU
5:
Read 4 records
> tapply(dat$V6, sub("-..$","", dat$V5), paste, collapse=" ")
let-7g
"UGAGGUAGUAGUUUGUACAGUU ACUGUACAGGCCACUGCCUUGC"
let-7i
"UGAGGUAGUAGUUUGUGCUGUU CUGCGCAAGCUACUGCCUUGCU"