Perl:Regex-用字母表匹配值

Perl:Regex-用字母表匹配值,regex,bash,perl,Regex,Bash,Perl,我编写了一个小的perl“hack”,用制表符分隔文件中一系列列中的字母替换1。该文件如下所示: Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach chr10 100177483 100177931 . . - 1 1 1 1 1 0

我编写了一个小的perl“hack”,用制表符分隔文件中一系列列中的字母替换1。该文件如下所示:

Chr Start   End Name    Score   Strand  Donor   Acceptor    Merged_Transcript   Gencode Colon   Heart   Kidney  Liver   Lung    Stomach
chr10   100177483   100177931   .   .   -   1   1   1   1   1   0   1   1   0   0
chr10   100178014   100179801   .   .   -   1   1   1   1   1   1   1   1   1   0
chr10   100179915   100182125   .   .   -   1   1   1   1   1   1   1   0   1   0
chr10   100182270   100183359   .   .   -   1   1   1   1   0   0   1   0   1   0
chr10   100183644   100184069   .   .   -   1   1   1   1   0   0   1   0   1   0
gola将获取第11列到第16列,如果这些列中的值为1,则将字母A附加到Z。到目前为止,我的代码生成了一个空输出,这是我第一次使用正则表达式

cat infile.txt \
| perl -ne '@alphabet=("A".."Z");
            $is_known_intron = 0;
            $is_known_donor = 1;
            $is_known_acceptor = 1;
            chomp;
            $_ =~ s/^\s+//;
            @d = split /\s+/, $_;
            @d_bool=@d[$11-$16];
            $ct=1;
            $known_intron = $d[$10];
            $num_of_overlapping_gene = $d[$9];
            $known_acceptor = $d[$8];
            $known_donor = $d[$7];
            $k="";
            if (($known_intron == $is_known_intron) and ($known_donor == $is_known_donor) and ($known_acceptor == $is_known_acceptor)) {
               for ($i = 0; $i < scalar @d_bool; $i++){
                   $k.=$alphabet[$i] if ($d_bool[$i])
                }
                $alphabet_ct{$k}+=$ct;
            }
            END
            {
               foreach $k (sort keys %alphabet_ct){
                   print join("\t", $k, $alphabet_ct{$k}), "\n";
               }
            } '\
   > Outfile.txt

等等。

为了便于调试,我将您的代码转换为脚本。我在代码中添加了注释,指出了一些不可靠的地方:

use strict;
use warnings;

my %alphabet_ct;
my @alphabet = ( "A" .. "Z" );

my $is_known_intron   = 0;
my $is_known_donor    = 1;
my $is_known_acceptor = 1;

while (<DATA>) {
    # don't process the first line
    next unless /chr10/;
    chomp;
    # this should remove whitespace at the beginning of the line but is doing nothing as there is none
    $_ =~ s/^\s+//;

    my @d = split /\s+/, $_;
    # the range operator in perl is .. (not "-")
    my @d_bool         = @d[ 10 .. 15 ];
    my $known_intron   = $d[9];
    my $known_acceptor = $d[7];
    my $known_donor    = $d[6];
    my $k              = "";
    # this expression is false for all the data in the sample you provided as
    # $is_known_intron is set to 0
    if (    ( $known_intron   == $is_known_intron )
        and ( $known_donor    == $is_known_donor )
        and ( $known_acceptor == $is_known_acceptor ) )
    {
        for ( my $i = 0; $i < scalar @d_bool; $i++ ) {
            $k .= $alphabet[$i] if $d_bool[$i];
        }
        # it is more idiomatic to write $alphabet_ct{$k}++;
        # $alphabet_ct{$k} += $ct;
        $alphabet_ct{$k}++;
    }
}
foreach my $k ( sort keys %alphabet_ct ) {
    print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}

__DATA__
Chr Start   End Name    Score   Strand  Donor   Acceptor    Merged_Transcript   Gencode Colon   Heart   Kidney  Liver   Lung    Stomach
chr10   100177483   100177931   .   .   -   1   1   1   1   1   0   1   1   0   0
chr10   100178014   100179801   .   .   -   1   1   1   1   1   1   1   1   1   0
chr10   100179915   100182125   .   .   -   1   1   1   1   1   1   1   0   1   0
chr10   100182270   100183359   .   .   -   1   1   1   1   0   0   1   0   1   0
chr10   100183644   100184069   .   .   -   1   1   1   1   0   0   1   0   1   0

您是否尝试过将调试语句放入代码中,以便在每一步都能看到它在做什么?您的代码有很多问题——首先,您应该添加
use strict;使用警告到脚本的开头,为您提供出错的线索。您知道数组在perl中是0索引的吗?所有对
$7
$8
$9
等的引用都没有返回任何值?您发布的预期输出是否与输入相符?此外,它的任何格式都已丢失--您需要将其缩进,或者删除换行符和间距。
use strict;
use warnings;

my %alphabet_ct;
my @alphabet = ( "A" .. "Z" );

my $is_known_intron   = 0;
my $is_known_donor    = 1;
my $is_known_acceptor = 1;

while (<DATA>) {
    # don't process the first line
    next unless /chr10/;
    chomp;
    # this should remove whitespace at the beginning of the line but is doing nothing as there is none
    $_ =~ s/^\s+//;

    my @d = split /\s+/, $_;
    # the range operator in perl is .. (not "-")
    my @d_bool         = @d[ 10 .. 15 ];
    my $known_intron   = $d[9];
    my $known_acceptor = $d[7];
    my $known_donor    = $d[6];
    my $k              = "";
    # this expression is false for all the data in the sample you provided as
    # $is_known_intron is set to 0
    if (    ( $known_intron   == $is_known_intron )
        and ( $known_donor    == $is_known_donor )
        and ( $known_acceptor == $is_known_acceptor ) )
    {
        for ( my $i = 0; $i < scalar @d_bool; $i++ ) {
            $k .= $alphabet[$i] if $d_bool[$i];
        }
        # it is more idiomatic to write $alphabet_ct{$k}++;
        # $alphabet_ct{$k} += $ct;
        $alphabet_ct{$k}++;
    }
}
foreach my $k ( sort keys %alphabet_ct ) {
    print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}

__DATA__
Chr Start   End Name    Score   Strand  Donor   Acceptor    Merged_Transcript   Gencode Colon   Heart   Kidney  Liver   Lung    Stomach
chr10   100177483   100177931   .   .   -   1   1   1   1   1   0   1   1   0   0
chr10   100178014   100179801   .   .   -   1   1   1   1   1   1   1   1   1   0
chr10   100179915   100182125   .   .   -   1   1   1   1   1   1   1   0   1   0
chr10   100182270   100183359   .   .   -   1   1   1   1   0   0   1   0   1   0
chr10   100183644   100184069   .   .   -   1   1   1   1   0   0   1   0   1   0
ABCDE   1
ABCE    1
ACD 1
CE  2