在R中使用seqinr包计算DNA序列的碱基
我有一个从fasta文件中提取的数组在R中使用seqinr包计算DNA序列的碱基,r,R,我有一个从fasta文件中提取的数组 > dat [1] "t" "a" "t" "t" "t" "a" "c" "c" "g" "a" "c" "g" "a" "a" "a" "
> dat
[1] "t" "a" "t" "t" "t" "a" "c" "c" "g" "a" "c" "g" "a" "a" "a" "t" "t" "a" "a" "t" "a" "c" "c" "a" "t" "c" "a" "g" "g" "g" "t" "a" "t"
[34] "t" "a" "a" "g" "a" "t" "g" "c" "t" "a" "c" "c" "a" "a" "c" "g" "t" "g" "g" "t" "a" "t" "t" "a" "a" "a" "a" "t" "g" "t" "g" "c" "c"
[67] "c" "a" "a" "c" "c" "g" "c" "g" "a" "a" "a" "a" "a" "g" "a" "a" "a" "g" "t" "g" "g" "t" "a" "t" "a" "t" "a" "g" "g" "a" "a" "a" "a"
序列要长得多,但这并不重要,我希望将此数组中的前100000个字符分成长度为1000的间隔,并计算每个间隔中的“g”碱基数。到目前为止,我已经尝试:
library(seqinr)
intervals = 1000*(0:99)
g_count = count(dat[intervals+1:intervals+1000], 1)[["g"]]
但这会返回错误:数值表达式有100个元素:只有第一个使用的
感谢您提供的任何帮助计算每个间隔内的“g”数,您可以使用此基本R方法:
n <- 1000
result <- tapply(dat, ceiling(seq_along(dat)/n), function(x) sum(x == 'g'))
n我们可以在base R
rowsum(+(dat == 'g'), as.integer(gl(length(dat), n, length(dat))))
数据
dat
rowsum(+(dat == 'g'), as.integer(gl(length(dat), n, length(dat))))
dat <- c("t", "a", "t", "t", "t", "a", "c", "c", "g", "a", "c", "g",
"a", "a", "a", "t", "t", "a", "a", "t", "a", "c", "c", "a", "t",
"c", "a", "g", "g", "g", "t", "a", "t")
n <- 11