C++ 大量序列中核苷酸类型的快速计数_C++_Count_Character_Fasta

C++ 大量序列中核苷酸类型的快速计数

c++

C++ 大量序列中核苷酸类型的快速计数,c++,count,character,fasta,C++,Count,Character,Fasta,首先，介绍一下我的问题的背景。我是一名生物信息学家，这意味着我进行信息学处理，试图回答一个生物学问题。在我的问题中，我必须处理一个名为FASTA文件的文件，该文件如下所示： >Header 1 ATGACTGATCGNTGACTGACTGTAGCTAGC >Header 2 ATGCATGCTAGCTGACTGATCGTAGCTAGC ATCGATCGTAGCT 所以FASTA文件基本上只是一个头，前面有一个“>”字符，然后是一个由核苷酸组成的一行或多行序列。核苷

首先，介绍一下我的问题的背景。
我是一名生物信息学家，这意味着我进行信息学处理，试图回答一个生物学问题。在我的问题中，我必须处理一个名为FASTA文件的文件，该文件如下所示：

>Header 1  
ATGACTGATCGNTGACTGACTGTAGCTAGC  
>Header 2  
ATGCATGCTAGCTGACTGATCGTAGCTAGC  
ATCGATCGTAGCT

所以FASTA文件基本上只是一个头，前面有一个“>”字符，然后是一个由核苷酸组成的一行或多行序列。核苷酸是可以取5个可能值的字符：A、T、C、G或N

我想做的是计数每个核苷酸类型出现的次数，如果我们考虑这个虚拟FASTA文件：

>Header 1  
ATTCGN

因此，我应该：

A:1t:2c:1g:1n:1

以下是我到目前为止得到的信息：

ifstream sequence_file(input_file.c_str());
string line;
string sequence = "";
map<char, double> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        nucleotide_counts['A'] = boost::count(sequence, 'A');
        nucleotide_counts['T'] = boost::count(sequence, 'T');
        nucleotide_counts['C'] = boost::count(sequence, 'C');
        nucleotide_counts['G'] = boost::count(sequence, 'G');
        nucleotide_counts['N'] = boost::count(sequence, 'N');
        sequence = "";
    }
}

编辑2：多亏了这篇文章中的每一个人，我的速度比原来的boost解决方案提高了30倍左右。代码如下：

#include <map> // std::array
#include <fstream> // std::ifstream
#include <string> // std::string  

void count_nucleotides(std::array<double, 26> &nucleotide_counts, std::string sequence) {
    for(unsigned int i = 0; i < sequence.size(); i++) {
        ++nucleotide_counts[sequence[i] - 'A'];
    }
}  

std::ifstream sequence_file(input_file.c_str());
std::string line;
std::string sequence = "";
std::array<double, 26> nucleotide_counts;

while(getline(sequence_file, line)) {
    if(line[0] != '>') {
        sequence += line;
    }
    else {
        count_nucleotides(nucleotide_counts, sequence);
        sequence = "";
    }
}

#包含//标准：：数组
#include//std:：ifstream
#include//std:：string
无效计数\核苷酸（std:：数组和核苷酸计数，std:：字符串序列）{
for（无符号整数i=0；i

之所以速度如此之慢，是因为您一直都在进行间接访问或对同一字符串进行5次扫描

您不需要映射，使用5个整数，然后分别递增它们。然后，它应该比

boost:：count

版本快，因为您不需要遍历字符串5次，并且它将比

map

或

无序映射

增量快，因为您没有n个间接访问

因此，请使用以下方法：

switch(char)
{
case 'A':
    ++a;
    break;
case 'G':
    ++g;
    break;
}
...

就像人们在评论中建议的那样，尝试这样的事情

enum eNucleotide {
    NucleotideA = 0,
    NucleotideT,
    NucleotideC,
    NucleotideG,
    NucleotideN,
    Size,
};

void countSequence(std::string line)
{
    long nucleotide_counts[eNucleotide::Size] = { 0 };

    if(line[0] != '>') {
        for(int i = 0; i < line.size(); ++i) 
        {
           switch (line[i])
           {
               case 'A':
                   ++nucleotide_counts[NucleotideA];
                   break;
               case 'T':
                   ++nucleotide_counts[NucleotideT];
                   break;                   
               case 'C':
                   ++nucleotide_counts[NucleotideC];
                   break;                   
               case 'G':
                   ++nucleotide_counts[NucleotideC];
                   break;                   
               case 'N':
                   ++nucleotide_counts[NucleotideN];
                   break;                   
               default : 
                   /// error condition
                   break;
           }     
        }

    /// print results
    std::cout << "A: " << nucleotide_counts[NucleotideA];
    std::cout << "T: " << nucleotide_counts[NucleotideT];
    std::cout << "C: " << nucleotide_counts[NucleotideC];
    std::cout << "G: " << nucleotide_counts[NucleotideG];
    std::cout << "N: " << nucleotide_counts[NucleotideN] << std::endl;
    }
}

enum eNucleotide{
核苷酸a=0，
核苷酸，
核苷酸，
核苷酸，
核苷酸，
大小，
};
void countSequence（标准：：字符串行）
{
长核苷酸计数[eNucleotide:：Size]={0}；
如果（第[0]行）！='>'）{
对于（int i=0；istd:：cout如果这是您必须执行的主要任务，您可能对awk解决方案感兴趣。使用awk可以很容易地解决FASTA文件的各种问题：
awk '/^>/ && c { for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" ; delete a }
    /^>/ {print; c++; next}
    { for(i=1;i<=length($0);++i) a[substr($0,i,1)]++ }
    END{ for(i in a) if (i ~ /[A-Z]/) printf i":"a[i]" "; print "" }' fastafile

注：我知道这不是C++，但它往往是有用的其他手段来实现同样的目标。

使用awk的基准测试：

测试文件：
无拉链尺寸：2.3G
总记录：5502947
总行数：

脚本0：（运行时：太长）前面提到的脚本速度非常慢。仅在小文件上使用
脚本1：（运行时：484.31秒）这是一个优化版本，我们在其中进行目标计数：
/^>/ && f { for(i in c) printf i":"c[i]" "; print "" ; delete c }
/^>/ {print; f++; next}
{   s=$0
    c["A"]+=gsub(/[aA]/,"",s)
    c["C"]+=gsub(/[cC]/,"",s)
    c["G"]+=gsub(/[gG]/,"",s)
    c["T"]+=gsub(/[tT]/,"",s)
    c["N"]+=gsub(/[nN]/,"",s)
}
END { for(i in c) printf i":"c[i]" "; print "" ; delete c }


更新2：（运行时：416.43秒）将所有子序列合并为一个序列，并只计算其中一个：
function count() {
    c["A"]+=gsub(/[aA]/,"",s)
    c["C"]+=gsub(/[cC]/,"",s)
    c["G"]+=gsub(/[gG]/,"",s)
    c["T"]+=gsub(/[tT]/,"",s)
    c["N"]+=gsub(/[nN]/,"",s)
}
/^>/ && f { count(); for(i in c) printf i":"c[i]" "; print "" ; delete c; string=""}
/^>/ {print; f++; next}
{ string=string $0 }
END { count(); for(i in c) printf i":"c[i]" "; print "" }

更新3：（运行时：396.12秒）优化awk如何查找其记录和字段，并一次性滥用
function count() {
    c["A"]+=gsub(/[aA]/,"",string)
    c["C"]+=gsub(/[cC]/,"",string)
    c["G"]+=gsub(/[gG]/,"",string)
    c["T"]+=gsub(/[tT]/,"",string)
    c["N"]+=gsub(/[nN]/,"",string)
}
BEGIN{RS="\n>"; FS="\n"}
{
  print $1
  string=substr($0,length($1)); count()
  for(i in c) printf i":"c[i]" "; print ""
  delete c; string=""
}

更新4:（运行时：259.69秒）在gsub
中更新正则表达式搜索。这将创建一个值得的加速：
function count() {
    n=length(string);
    gsub(/[aA]+/,"",string); m=length(string); c["A"]+=n-m; n=m
    gsub(/[cC]+/,"",string); m=length(string); c["C"]+=n-m; n=m
    gsub(/[gG]+/,"",string); m=length(string); c["G"]+=n-m; n=m
    gsub(/[tT]+/,"",string); m=length(string); c["T"]+=n-m; n=m
    gsub(/[nN]+/,"",string); m=length(string); c["N"]+=n-m; n=m
}
BEGIN{RS="\n>"; FS="\n"}
{
  print ">"$1
  string=substr($0,length($1)); count()
  for(i in c) printf i":"c[i]" "; print ""
  delete c; string=""
}

如果需要速度并且可以使用数组，请不要使用映射。此外，还可以使用自定义分隔符（而不是\n
）
ifstream序列_文件（input_file.c_str（））；
字符串序列=”；
std：：阵列核苷酸计数；
//对于一个序列
getline（序列文件，序列'>'）；
用于（自动和c:顺序）{
++核苷酸_计数[c-'A']；
}
//核苷酸计数['X'-'A']包含序列中核苷酸X的计数

按重要性顺序：
用于此任务的好代码将100%是I/O绑定的。处理器计算字符的速度远远快于磁盘向CPU输送字符的速度。因此，我要问的第一个问题是：存储介质的吞吐量是多少？理想的RAM和缓存吞吐量是多少？这些是上限。如果达到了上限，就没有了进一步查看您的代码非常重要。您的boost解决方案可能已经存在

std:：map
查找相对昂贵。是的，它是O（log（N））
，但您的N=5
很小且恒定，因此这不会告诉您任何信息。对于5个值，map必须为每次查找跟踪大约三个指针（更不用说对于分支预测器来说这是多么不可能）。您的count
解决方案有5次map查找和每个字符串的5次遍历，而您的手动解决方案有一次map查找每个核苷酸（但只有一次遍历字符串）
严肃的建议：为每个计数器使用一个局部变量。这些变量几乎肯定会被放入CPU寄存器中，因此基本上是免费的。与map
，un不同，您永远不会用计数器污染缓存
function count() {
    c["A"]+=gsub(/[aA]/,"",string)
    c["C"]+=gsub(/[cC]/,"",string)
    c["G"]+=gsub(/[gG]/,"",string)
    c["T"]+=gsub(/[tT]/,"",string)
    c["N"]+=gsub(/[nN]/,"",string)
}
BEGIN{RS="\n>"; FS="\n"}
{
  print $1
  string=substr($0,length($1)); count()
  for(i in c) printf i":"c[i]" "; print ""
  delete c; string=""
}

function count() {
    n=length(string);
    gsub(/[aA]+/,"",string); m=length(string); c["A"]+=n-m; n=m
    gsub(/[cC]+/,"",string); m=length(string); c["C"]+=n-m; n=m
    gsub(/[gG]+/,"",string); m=length(string); c["G"]+=n-m; n=m
    gsub(/[tT]+/,"",string); m=length(string); c["T"]+=n-m; n=m
    gsub(/[nN]+/,"",string); m=length(string); c["N"]+=n-m; n=m
}
BEGIN{RS="\n>"; FS="\n"}
{
  print ">"$1
  string=substr($0,length($1)); count()
  for(i in c) printf i":"c[i]" "; print ""
  delete c; string=""
}

ifstream sequence_file(input_file.c_str());
string sequence = "";
std::array<int, 26> nucleotide_counts;

// For one sequence
getline(sequence_file, sequence, '>');
for(auto&& c : sequence) {
    ++nucleotide_counts[c-'A'];
}

// nucleotide_counts['X'-'A'] contains the count of nucleotide X in the sequence