Optimization 使用R分割字符串和计数字符的更快方法？_Optimization_String_R_Bioinformatics

Optimization 使用R分割字符串和计数字符的更快方法？

optimization string r

Optimization 使用R分割字符串和计数字符的更快方法？,optimization,string,r,bioinformatics,Optimization,String,R,Bioinformatics,我正在寻找一种更快的方法来计算从FASTA文件读取的DNA字符串的GC内容。这归结为获取一个字符串并计算字母“G”或“C”出现的次数。我还想指定要考虑的字符范围。我有一个相当慢的工作函数，它在我的代码中造成了瓶颈。看起来是这样的： ## ## count the number of GCs in the characters between start and stop ## gcCount <- function(line, st, sp){ chars = strsplit(a

我正在寻找一种更快的方法来计算从FASTA文件读取的DNA字符串的GC内容。这归结为获取一个字符串并计算字母“G”或“C”出现的次数。我还想指定要考虑的字符范围。我有一个相当慢的工作函数，它在我的代码中造成了瓶颈。看起来是这样的：

##
## count the number of GCs in the characters between start and stop
##
gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]]
  numGC = 0
  for(j in st:sp){
    ##nested ifs faster than an OR (|) construction
    if(chars[[j]] == "g"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "G"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "c"){
      numGC <- numGC + 1
    }else if(chars[[j]] == "C"){
      numGC <- numGC + 1
    }
  }
  return(numGC)
}

有没有让代码更快的建议？

这里不需要使用循环

试试这个：

gcCount <-  function(line, st, sp){
  chars = strsplit(as.character(line),"")[[1]][st:sp]
  length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}

gcCountA一行：
table(strsplit(toupper(a), '')[[1]])

最好不要分割，只需计算匹配项：
gcCount2 <-  function(line, st, sp){
  sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}

gcCount2（0）
}

这要快一个数量级
一个只对字符进行迭代的小C函数会快一个数量级。
我不知道它会快多少，但您可能想看看R包seqinR-。它是一个优秀的通用生物信息学软件包，包含许多序列分析方法。它在CRAN中（我写这篇文章的时候，它似乎在下面）
GC内容将是：
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
    GC(mysequence)  # 0.4761905

mysequence从stringi
软件包中尝试此功能
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5

或者您可以使用regex版本来计算g和g
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12

或者您可以先使用tolower函数，然后使用stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"

时间性能
    > microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
                             expr     min     lq  median      uq     max neval
                gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492   100
               gcCount2(x, 1, 40)  15.010  16.51  18.312  19.213  40.826   100
 stri_count_regex(x, c("[GgCc]"))  15.610  16.51  18.912  20.112  61.239   100

长字符串的另一个示例。stri_dup将字符串复制n次
> stri_dup("abc",3)
[1] "abcabcabc"

如您所见，序列越长，stri_计数越快：）
>y微基准（gcCount（y，1,40*100）、gcCount2（y，1,40*100）、stri_count_regex（y，c（“[GgCc]”））
单位：微秒
expr最小lq中值uq最大neval
总计数（y，1，40*100）10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2（y，1，40*100）360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex（y，c（“[GgCc]”）131.483 137.9370 151.8955 176.511 221.839 100
感谢所有人发表这篇文章
为了优化一个脚本，在这个脚本中我想计算100M个200bp序列的GC含量，我最终测试了这里提出的不同方法。Ken Williams的方法表现最好（2.5小时），优于Sekinr（3.6小时）。使用stringr str_计数减少到1.5小时
最后，我用C++编码它，并使用RCPP调用它，它将计算时间缩短到10分钟！
这里是C++代码：
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
  int count = 0;

  for (int i = 0; i < s.size(); i++) 
    if (s[i] == 'G') count++;
    else if (s[i] == 'C') count++;

  float pGC = (float)count / s.size();
  pGC = pGC * 100;
  return pGC;
}

投票通过。比我的答案更好（应该是：在BioPerl中进行：-）谢谢。这大约快了4倍，这几乎比我根据Rajarshi的代码构建的函数快。你可以看出我仍在学习R-很难打破我多年来一直使用的以循环为中心的思维。你可以尝试的另一件事是：tolower（chars）%in%c（“g”，“c”）
。虽然我怀疑OR|
运算符比%
中的%更快，但不确定哪一个更快。甚至更好（~7x）。谢谢此功能的一个重要补充-请注意，如果子字符串不包含G\C，则长度函数可能返回（-1）而不是0，因此需要对此进行检查。感谢dan12345-@user2265478刚刚建议进行编辑以修复此问题，我将其合并（尽管该编辑被拒绝[我没有]）。出于它的价值，我最终决定R太慢，无法处理来自人类基因组的约30亿个碱基对，于是使用了一点perl脚本。
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
    > microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
    Unit: microseconds
                                 expr       min         lq     median        uq       max neval
              gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828   100
             gcCount2(y, 1, 40 * 100)   360.225   369.5315   383.6400   399.100   438.274   100
     stri_count_regex(y, c("[GgCc]"))   131.483   137.9370   151.8955   176.511   221.839   100

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
  int count = 0;

  for (int i = 0; i < s.size(); i++) 
    if (s[i] == 'G') count++;
    else if (s[i] == 'C') count++;

  float pGC = (float)count / s.size();
  pGC = pGC * 100;
  return pGC;
}

sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")