使用perl脚本计算核苷酸频率_Perl_Bioinformatics_Fasta

使用perl脚本计算核苷酸频率

perl

使用perl脚本计算核苷酸频率,perl,bioinformatics,fasta,Perl,Bioinformatics,Fasta,下面我有一个perl脚本，用于计算序列长度及其频率以及核苷酸频率a、T、G和C。该脚本适用于具有大量序列的文件，但对于这样的小文件，它不能给出正确的结果：法斯塔酒店 count.pl #!/usr/bin/perl -w #usage ./count.pl infile min_length max_length #usage ./count.pl infile 18 34 my $min_len = $ARGV[1]; my $max_len = $ARGV[2]; my $read_l

下面我有一个perl脚本，用于计算序列长度及其频率以及核苷酸频率a、T、G和C。该脚本适用于具有大量序列的文件，但对于这样的小文件，它不能给出正确的结果：

法斯塔酒店

count.pl

#!/usr/bin/perl -w

#usage ./count.pl infile min_length max_length
#usage ./count.pl infile 18 34

my $min_len = $ARGV[1];
my $max_len = $ARGV[2];
my $read_len = 0;
my @lines = ("header1","sequence","header2","quality");
my @lray = ();
my $count = 0;
my $total = 0;
my $i = 0;

my @Aray = ();
my @Cray = ();
my @Gray = ();
my @Tray = ();

my$FN = "";

for ($i=$min_len; $i<=$max_len; $i++){
   $lray[$i] = 0;
}

open (INFILE, "<$ARGV[0]") || die "couldn't open input file!";
   while (<INFILE>) {
      $lines[$count] = $_;
      chomp($lines[$count]);
      $count++;
      if($count eq 4){
         $read_len = length($lines[1]); 
#         print "$read_len $lines[1]\n";
         $FN = substr $lines[1], 0, 1;  
         $lray[$read_len]++;
         if ($FN eq "T") { $Tray[$read_len]++;} 
         else {        
            if ($FN eq "A"){ $Aray[$read_len]++;}
            else {
               if ($FN eq "C"){ $Cray[$read_len]++;}
               else {
                 if ($FN eq "G"){ $Gray[$read_len]++;}
               }   
            }
         }           
         $count = 0;
      }
   }
print "length\tnumber\tA\tC\tG\tT\n";
for ($i=$min_len; $i<=$max_len; $i++){
   print "$i\t$lray[$i]\t$Aray[$i]\t$Cray[$i]\t$Gray[$i]\t$Tray[$i]\n";
}
exit;

如果您能帮我更正此代码，我将不胜感激。感谢您尝试不要重新发明方向盘，因此，使用该模块，您获得了：

use 5.014;
use warnings;
use FAST::Bio::SeqIO;

my $fasta  = FAST::Bio::SeqIO->new(-file => "infile.fasta", -format => 'Fasta');
my $seqnum=0;
while ( my $seq = $fasta->next_seq() ) {
    my $stats;
    $stats->{len} = length($seq->seq);
    $stats->{$_}++ for split //, $seq->seq;
    say ++$seqnum, " @$stats{qw(len A C G T)}";
}

上面，对于您的演示infle.fasta打印：

或者

use 5.014;
use warnings;
use FAST::Bio::SeqIO;

my $fasta  = FAST::Bio::SeqIO->new(-file => "file.fasta", -format => 'Fasta');
my $stats;
while ( my $seq = $fasta->next_seq() ) {
    my $len = length($seq->seq);
    $stats->{$len}{count}++;
    $stats->{$len}{$_}++ for split //, $seq->seq;
}
say "Length $_ ($stats->{$_}->{count} times) Letters freq: @{$stats->{$_}}{qw(A C G T)}" for sort { $a <=> $b }  keys %$stats;

等等…

您应该使用use strict启动每个Perl文件；使用警告；。不要使用-w；它从2000年就过时了。@melpomene谢谢！我听从了你的建议，仍然没有得到正确的结果。@melpomene在连接中使用了未初始化的值。或count.pl处的字符串是错误消息。非常感谢。如我上面的输出所示，你如何计算相同长度的序列数量以及每种序列长度类型的4个核苷酸的总频率？@MAPK我不知道你从一个包含许多序列的神秘大文件中输出的结果。我只能使用您提供的数据——我还提供了一些演示代码——来使用模块和计数字母。我对核苷酸等一无所知。。。

use 5.014;
use warnings;
use FAST::Bio::SeqIO;

my $fasta  = FAST::Bio::SeqIO->new(-file => "infile.fasta", -format => 'Fasta');
my $seqnum=0;
while ( my $seq = $fasta->next_seq() ) {
    my $stats;
    $stats->{len} = length($seq->seq);
    $stats->{$_}++ for split //, $seq->seq;
    say ++$seqnum, " @$stats{qw(len A C G T)}";
}

1 20 1 5 5 9
2 19 6 4 6 3
3 22 4 2 7 9
4 21 3 5 4 9
5 23 5 7 5 6
6 22 6 8 3 5
7 22 7 7 3 5
8 22 10 8 3 1
9 21 9 5 2 5
10 22 10 8 3 1
11 21 1 9 2 9
12 24 8 3 8 5
13 20 4 4 5 7

use 5.014;
use warnings;
use FAST::Bio::SeqIO;

my $fasta  = FAST::Bio::SeqIO->new(-file => "file.fasta", -format => 'Fasta');
my $stats;
while ( my $seq = $fasta->next_seq() ) {
    my $len = length($seq->seq);
    $stats->{$len}{count}++;
    $stats->{$len}{$_}++ for split //, $seq->seq;
}
say "Length $_ ($stats->{$_}->{count} times) Letters freq: @{$stats->{$_}}{qw(A C G T)}" for sort { $a <=> $b }  keys %$stats;

Length 19 (1 times) Letters freq: 6 4 6 3
Length 20 (2 times) Letters freq: 5 9 10 16
Length 21 (3 times) Letters freq: 13 19 8 23
Length 22 (5 times) Letters freq: 37 33 19 21
Length 23 (1 times) Letters freq: 5 7 5 6
Length 24 (1 times) Letters freq: 8 3 8 5