Arrays Perl:跨数组元素搜索模式

Arrays Perl:跨数组元素搜索模式,arrays,perl,bioinformatics,Arrays,Perl,Bioinformatics,我是一个Perl新手,遇到了另一个需要帮助和输入的生物信息学问题 问题简介: 我有一个文件,有超过40000个独特的DNA序列。我所说的唯一,是指唯一的序列id。我在我的帖子末尾附上了它的一部分,以帮助你展示它的样子 我需要将40000个序列分为3部分。因此,如果一个特定序列的长度为999个字符,那么这3个部分中的每个部分都有333个字符 我需要通过3个单独的部分中的每个部分寻找以下模式: $gpat=[G]{3,5}; $npat=[A-Z]{1,25} $pattern=$gpat.$n

我是一个Perl新手,遇到了另一个需要帮助和输入的生物信息学问题

问题简介:

  • 我有一个文件,有超过40000个独特的DNA序列。我所说的唯一,是指唯一的序列id。我在我的帖子末尾附上了它的一部分,以帮助你展示它的样子

  • 我需要将40000个序列分为3部分。因此,如果一个特定序列的长度为999个字符,那么这3个部分中的每个部分都有333个字符

  • 我需要通过3个单独的部分中的每个部分寻找以下模式:

    $gpat=[G]{3,5}; $npat=[A-Z]{1,25}
    $pattern=$gpat.$npat.$gpat.$npat.$gpat.$npat.$npat.$gpat

  • 如果$pattern出现在3个部分的第一部分,则增加“开始”的计数器;如果$pattern出现在3个部分的第二部分,则增加“中间”的计数器;最后,如果$pattern出现在第三部分,则增加“结束”的计数器

  • 打印“开始”、“中间”和“结束”的计数器,即每个序列的“开始”、“中间”和“结束”的总和

    假设在第一个序列中,值分别为'2','5','3',在第二个序列中,值为'4','1','6',最终计数应为'7,6,9'

  • 我遇到的问题:

  • 如果一个特定的序列被分成3部分,潜在的$pattern就会丢失。例如,按如下顺序说:
  • GGGATGTGATGCATGGGGATGCATCGATGCGGGGACTAGTCAGCGGGATGCTACGATGGATGATTAATATCCGGCGCATATAGATGCTATCATATATATATAGCGGCGCATATAGATGCTATCATATATATATATATATATATAGTATCTATTA

    拆分为3个部分将生成以下3个子部分,每个子部分的长度为35个字符:

    gggatgtcgatgcatggggatgcatcgatgcgggg
    ACTAGCTAGCGGGATGCTACGATGGATGAT
    AATATCGCGCATATATGCTAGTCTATATATA

    因此,$pattern被分为前两部分。有并没有说“若$pattern在第一部分开始,在第二部分结束”,增加“开始”的计数

    ######由于Cupidvogel建议的代码,以下问题已得到解决

    2.如果一个序列的长度不能被3整除,我如何将它分成3部分?我尝试使用
    int
    ,但最后一部分是1-2 字符短

    #Take Filename from user
    print "Please enter file name : ";
    $in =<>;
    chomp $in;
            
            
    open (FASTA,"$in") or die ;
    while (<FASTA>)
    {
    $/=">";
    @array = split '\n', $_;
    $header=shift @array; # Header of the fasta sequence
    print "\n\nNext sequence: \n";
    print $header,"\n";
                
                
    $seq= join '', @array; # sequence
    $seq=~s/\s//g;
    $seq=~s/\*//g;
    $seq=~s/>//g;
    print $seq,"\n\n";
    
    $num = int(length($seq)/3);
    @arr = unpack("A$num A$num A*",$seq);
    print " New method gives this :", @arr;
    print "\nThe first element is :", $arr[0]; 
    print "\nThe second element is :",$arr[1]; 
    print "\nThe third element is :",$arr[2] ;
    
                
                
    #The following lines of code were originally written to split...
    #...the sequence into 3 parts, albeit unsuccessfully                    
    #my $split = (length $seq)/3;
    #print $split,"\n\n";
             
    #my $int = int $split;
    #print $int,"\n\n";
             
             
    #my @array2 = $seq =~ /(.{$int})/g;
    #print join (" ", @array2),"\n\n";
            
    #print $array2[0],"\n",$array2[1],"\n",$array2[2];
            
                
    }
            
            
    exit;
    
    以下是到目前为止我编写的代码

    它读入文件,显示标题名和序列,每个序列将被分割的长度,最后将序列分割为3个部分,如果序列长度可以被3整除,则效果很好,对于那些不能被3整除的序列,最后的第3部分短1-2个字符

    #Take Filename from user
    print "Please enter file name : ";
    $in =<>;
    chomp $in;
            
            
    open (FASTA,"$in") or die ;
    while (<FASTA>)
    {
    $/=">";
    @array = split '\n', $_;
    $header=shift @array; # Header of the fasta sequence
    print "\n\nNext sequence: \n";
    print $header,"\n";
                
                
    $seq= join '', @array; # sequence
    $seq=~s/\s//g;
    $seq=~s/\*//g;
    $seq=~s/>//g;
    print $seq,"\n\n";
    
    $num = int(length($seq)/3);
    @arr = unpack("A$num A$num A*",$seq);
    print " New method gives this :", @arr;
    print "\nThe first element is :", $arr[0]; 
    print "\nThe second element is :",$arr[1]; 
    print "\nThe third element is :",$arr[2] ;
    
                
                
    #The following lines of code were originally written to split...
    #...the sequence into 3 parts, albeit unsuccessfully                    
    #my $split = (length $seq)/3;
    #print $split,"\n\n";
             
    #my $int = int $split;
    #print $int,"\n\n";
             
             
    #my @array2 = $seq =~ /(.{$int})/g;
    #print join (" ", @array2),"\n\n";
            
    #print $array2[0],"\n",$array2[1],"\n",$array2[2];
            
                
    }
            
            
    exit;
    
    实际输入文件如下所示:

    >NR_037701 1
    aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
    tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
    aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
    ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
    gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
    agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
    gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
    agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
    gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
    gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
    cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
    aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
    actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
    ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
    tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
    cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
    ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
    gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
    cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
    caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
    gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
    ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
    tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
    tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
    acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
    aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
    catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
    ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
    tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
    gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
    ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
    atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
    gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
    gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
    gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
    gctccctcttttaaagattttccttccctctttccaactccctgggtcct
    ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
    tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
    ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
    agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
    gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
    cacagactcaaaccctctctcacacacatacacatatacattgttattcc
    acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
    ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
    caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
    tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
    agggttgggacttcaacacagctttttgggggatcataattcaacccatg
    acagccactgagattattatatctccagagaataaatgtgtggagttaaa
    aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
    ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
    gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
    tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
    tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
    aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
    ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
    actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
    gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
    gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
    tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
    gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
    ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
    gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
    aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
    ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
    caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
    actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
    tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaa
    >NM_198399 1
    aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
    agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
    caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
    tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
    tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
    attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
    caatagcagaggaaggagggactgagcaggagacggccactccagagaac
    ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
    gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
    caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
    gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
    aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
    gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
    ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
    cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
    tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
    atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
    gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
    actgcatatttttacccttatttttgctccttacagcaagattagtaggt
    tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
    tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
    taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
    tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
    attgattaagatatcattatttttgtttggtttggttttgcttttttcct
    cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
    tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
    tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
    atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
    gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
    cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
    aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
    tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
    aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
    gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
    ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
    gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
    tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
    agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
    gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
    gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
    ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
    agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
    tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
    aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
    cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
    tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
    atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
    ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
    aaaaaaaaaaaaa
    >NR_026816 1
    caacccactctctgtgctatgacttcattactctttcccagcccagccct
    gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
    atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
    aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
    agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
    ccagagccaaggctacagctagagagttgactcctctatttgagattgac
    aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
    tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
    cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
    gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
    tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
    tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
    >NR_027917 1
    atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
    cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
    ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
    ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
    gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
    ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
    tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
    ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
    ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
    cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
    gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
    ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
    gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
    ggtgtag
    >NR_002777 3
    cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
    ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
    gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
    aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
    tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
    aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
    cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
    taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
    ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
    agcctaatagagaacataaaattctaaaagataaagataataataatgat
    aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
    tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
    aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
    gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
    cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
    tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
    tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
    tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
    gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
    ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
    cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
    cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
    aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
    cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
    ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
    actcataggacataaaaaaaaaaaaaa
    >NR_033769 1
    ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
    agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
    agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
    aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
    ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
    atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
    cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
    acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
    ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
    tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
    gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
    tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
    gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
    cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
    tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
    ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca
    cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
    ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
    tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
    gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
    tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
    tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
    aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
    aagttaggtttatagagtttgactagttttttcgattagatttgtattag
    ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
    ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
    ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
    taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
    atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
    gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
    tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
    atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
    aaagaaaagaaaaacagagacggtc
    >NM_016326 3
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
    cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
    aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
    tttgtaattttatattactttttagtttgatactaagtattaaacatatt
    tctgtattcttccacatattttctgcagttattttaactcagtataggag
    ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
    acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
    gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
    atctctttactgcctggctggccggcagctccg
    >NM_181641 2
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
    tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
    ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
    acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
    ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
    cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
    aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
    aaacatatttctgtattcttccacatattttctgcagttattttaactca
    gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
    actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
    ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
    gatctttgaatctctttactgcctggctggccggcagctccg
    >NM_001144931 1
    gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
    ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
    ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
    ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
    gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
    acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
    gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc
    agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
    atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
    ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
    cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
    ttgggaggccgag
    >NR_029429 1
    ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
    ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
    caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
    atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
    actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
    tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
    tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
    caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
    cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
    gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
    cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
    gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
    cttcctgc
    >NR_026551 1
    tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc
    taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
    ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
    gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
    gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
    acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
    gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
    cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
    gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
    gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
    agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
    aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
    cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
    gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
    gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
    tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
    cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
    cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
    gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
    gagcagagccctcactcccaggcagagttgtctgaatccttcct
    >NM_181640 2
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
    taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
    accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
    atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
    ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
    taattttatattactttttagtttgatactaagtattaaacatatttctg
    tattcttccacatattttctgcagttattttaactcagtataggagctag
    aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
    ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
    gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
    ctttactgcctggctggccggcagctccg
    >NM_016951 3
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
    tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
    ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
    acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
    actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
    ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
    gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
    tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
    gaagttttgtaattttatattactttttagtttgatactaagtattaaac
    atatttctgtattcttccacatattttctgcagttattttaactcagtat
    aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
    ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
    cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
    tttgaatctctttactgcctggctggccggcagctccg
    >NR_002773 1
    cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca
    ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
    ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
    ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
    gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
    ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
    gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
    tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
    gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
    ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
    gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
    gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
    tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
    caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
    ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
    acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
    gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
    gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
    cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc
    ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
    gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
    ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
    agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
    cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
    agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
    ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
    cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
    tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
    acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
    gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
    aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
    gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
    caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
    atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
    gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
    gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
    ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
    ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
    ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
    gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
    cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
    cactccatcctgcgtgactgaac
    >NR_037806 1
    attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
    ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
    ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
    cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
    gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
    catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
    agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
    ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
    gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
    gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
    gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
    gattaaaatcacactaataacccctggatggtcaatctgataataggatc
    agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
    agattattcccagtctcccagtaacacgtttctacccagatcctttttca
    tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
    aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
    ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
    gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
    tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
    tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
    aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
    ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
    ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
    ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
    tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
    aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
    agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
    agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
    gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
    attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa
    atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
    gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
    gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg
    ggattagctctg
    
    任何帮助和投入都将不胜感激


    谢谢你花时间来解决我的问题

    我认为这项工作的方式不是将序列分成三部分,而是查找完整序列中出现的所有
    $pattern
    ,并确定模式从哪三部分开始

    内置变量
    $-[0]
    包含最近一次成功匹配开始的偏移量

    下面的代码符合我的要求。它通过累加每个序列(在找到新序列ID或到达文件末尾时结束)并将其传递给
    进程_seq
    子例程来工作

    子例程获取序列的长度,并计算字符串每三分之一末端的偏移量。惯用的
    sprintf'%.0f',$value
    用于将分数值舍入到最近的字符位置

    序列中每次出现
    $regex
    ,都会调整
    @counts
    数组。要增加的
    @counts
    元素是通过比较
    $-[0]
    中匹配的起始位置与序列三个段中每个段的结束偏移量来确定的

    处理每个序列后,
    @计数
    中的值将累加到
    @总计
    中,以给出所有序列的总体数字

    显示使用示例数据时程序的输出。总计为
    (9,1,6)


    我认为这项工作的方式不是将序列分为三个部分,而是查找完整序列中出现的所有
    $pattern
    ,并确定模式从哪三个部分开始

    内置变量
    $-[0]
    包含最近一次成功匹配开始的偏移量

    下面的代码符合我的要求。它通过累加每个序列(在找到新序列ID或到达文件末尾时结束)并将其传递给
    进程_seq
    子例程来工作

    子例程获取序列的长度,并计算字符串每三分之一末端的偏移量。惯用的
    sprintf'%.0f',$value
    用于将分数值舍入到最近的字符位置

    序列中每次出现
    $regex
    ,都会调整
    @counts
    数组。要增加的
    @counts
    元素是通过比较
    $-[0]
    中匹配的起始位置与序列三个段中每个段的结束偏移量来确定的

    处理每个序列后,
    @计数
    中的值将累加到
    @总计
    中,以给出所有序列的总体数字

    显示使用示例数据时程序的输出。总计为
    (9,1,6)


    我取消了Borodin的process_seq函数,但使用Bio:SeqIO按顺序读取文件,这比手动逐行读取和确定各种处理的逻辑更具优势。我相信这些优点是:

    • 由许多其他人开发和测试的代码
    • 只要有可能,如果输出是通过Bio::SeqIO模块完成的,则可以创建结果文件
      use strict;
      use warnings;
      
      my $gpat = '[G]{3,5}';
      my $npat = '[A-Z]{1,25}';
      my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; 
      my $regex = qr/$pattern/i;
      
      open my $fh, '<', 'sequences.txt' or die $!;
      
      my ($id, $seq);
      my @totals = (0, 0, 0);
      
      while (<$fh>) {
      
        chomp;
      
        if (/^>(\w+)/) {
          process_seq($seq) if $id;
          $id = $1;
          $seq = '';
          print "$id\n";
        }
        elsif ($id) {
          $seq .= $_;
          process_seq($seq) if eof;
        }
      }
      
      print "Total: @totals\n";
      
      
      
      sub process_seq {
      
        my $sequence = shift;
        my $length = length $sequence;
      
        my @offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3;
      
        my @counts = (0, 0, 0);
      
        while ($sequence =~ /$regex/g) {
          my $place = $-[0];
          for my $i (0..2) {
            next if $place >= $offsets[$i];
            $counts[$i]++;
            last;
          }
        }
      
        print "@counts\n\n";
        $totals[$_] += $counts[$_] for 0..2;
      }
      
      NR_037701
      0 0 1
      
      NM_198399
      1 0 0
      
      NR_026816
      1 0 1
      
      NR_027917
      0 0 0
      
      NR_002777
      0 0 0
      
      NR_033769
      1 0 0
      
      NM_016326
      1 0 1
      
      NM_181641
      1 0 1
      
      NM_001144931
      0 0 0
      
      NR_029429
      0 1 0
      
      NR_026551
      1 0 0
      
      NM_181640
      1 0 1
      
      NM_016951
      1 0 1
      
      NR_002773
      1 0 0
      
      NR_037806
      0 0 0
      
      Total: 9 1 6
      
      #!/usr/bin/perl
      use strict;
      use warnings;
      use Bio::SeqIO;
      
      my $gpat = '[G]{3,5}'; 
      my $npat = '[A-Z]{1,25}'; 
      my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;  
      my $regex = qr/$pattern/i; 
      
      my $in = Bio::SeqIO->new ( -file   => "fasta_dat.txt",
                                 -format => 'fasta');
      my @totals;
      while ( my $seq = $in->next_seq() ) {
          process($seq);
      }
      
      print "Totals:   ";
      print "@totals\n";
      
      sub process {
          my $seq = shift;
          my @offset = map {sprintf '%.0f', $seq->length * $_ / 3} 1..3;
          my $sequence = $seq->seq;
      
          my @count = (0,0,0);
          while ($sequence =~ /$regex/g) {
              my $place = $-[0];
              for my $i (0 .. 2) {
                  next if $place >= $offset[$i];
                  $count[$i]++;
                  last;
              }
          }
          print $seq->id, "\n@count\n";
          $totals[$_] += $count[$_] for 0 .. $#count; 
      }