编写一个Perl脚本，接收fasta并反转所有序列（不带BioPerl）？_Perl_Bioinformatics_Fasta

编写一个Perl脚本，接收fasta并反转所有序列（不带BioPerl）？

perl

编写一个Perl脚本，接收fasta并反转所有序列（不带BioPerl）？,perl,bioinformatics,fasta,Perl,Bioinformatics,Fasta,我不知道这是否只是Stawberry Perl的一个怪癖，但我似乎无法让它运行。我只需要做一个快速测试，并反转其中的每一个序列 -问题- 我有一个multifasta文件： >seq1 ABCDEFG >seq2 HIJKLMN 预计产量为： >REVseq1 GFEDCBA >REVseq2 NMLKJIH 脚本如下： $NUM_COL = 80; ## set the column width of output file $infile = shift; ##

我不知道这是否只是Stawberry Perl的一个怪癖，但我似乎无法让它运行。我只需要做一个快速测试，并反转其中的每一个序列

-问题-

我有一个multifasta文件：

>seq1
ABCDEFG
>seq2
HIJKLMN

预计产量为：

>REVseq1
GFEDCBA
>REVseq2
NMLKJIH

脚本如下：

$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my @proteins = split (/>/, $text); ## put all input sequences into an array


for my $protein (@proteins) { ## evaluate each input sequence individually
    $protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
    ## the FATA-formatted protein
    my $name = $1; ## remember the name of the input sequence
    print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
    ## protein stand out in a list
    $protein =~ s/\n//gm; ## remove newline characters from sequence
    $protein = reverse($protein); ## reverse the sequence

    while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols

    $protein =~ s/(.{$NUM_C0L})//;
    my $line = $1;
    print $OUT "$line\n";
    }
    print $OUT "$protein\n"; ## print last portion of reversed protein
}

close ($IN);
close ($OUT);
print "done\n";

$NUM_COL=80；#设置输出文件的列宽
$infle=shift；##从命令行获取输入序列文件名
$outfile=“test1.txt”#命名输出文件，在前面加上“REV”
打开（我的$IN$INFLE）；
打开（我的$OUT，“>”，$outfile）；
$/=未定义；##允许将整个输入序列文件读入内存
我的$text=；##将输入序列文件读入内存
打印$text；##将序列文件输出到新的诱饵序列文件中
我的@proteins=split（/>，$text）；#将所有输入序列放入一个数组中
对于我的$protein（@proteins）{##分别评估每个输入序列
$protein=~s/（^.*）\n//m；##匹配并删除
##FATA格式的蛋白质
my$name=$1；##记住输入序列的名称
打印$OUT“>REV$name\n”；##在前面加上#REV#a将有助于
##蛋白质在列表中脱颖而出
$protein=~s/\n//gm；##从序列中删除换行符
$protein=reverse（$protein）；##颠倒顺序
while（length（$protein）>NUM_C0L）{##循环使用设置的列数打印序列
$protein=~s/（.{$NUM_C0L}）/；
我的$line=$1；
打印$OUT“$line\n”；
}
打印$OUT“$protein\n”##打印反转蛋白质的最后一部分
}
收盘价（美元）；
收尾（美元）；
打印“完成”\n；

这将按您的要求执行

它从fasta文件中构建一个hash

%fasta

，保留数组

@keys

以保持序列有序，然后打印出hash的每个元素

在将序列的每一行添加到散列之前，使用

reverse

对其进行反转，并使用

unshift

按相反顺序添加序列的行

程序希望输入文件作为命令行上的参数，并将结果打印到STDOUT，STDOUT可以在命令行上重定向

使用严格；
使用“全部”警告；
我的（%fasta，@keys）；
{
我的$key；
而（）{
咀嚼；
如果（s/^>\K/REV/）{
$key=$\ux；
按@键，$key；
}
elsif（关键）{
反移位{$fasta{$key}}，标量反转；
}
}
}
对于我的$key（@keys）{
打印$key“\n”；
为@{$fasta{$key}打印“$\n”；
}

输出

>REVseq1
GFEDCBA
>REVseq2
NMLKJIH

更新如果您更喜欢重写序列，以便在末尾有短行，那么您只需要重写转储哈希的代码

此备选方案使用原始文件中最长行的长度作为限制，并将反转序列重新包装为相同的长度。克莱尔认为，指定一个显式长度而不是计算它会很简单

您需要在程序顶部添加

use List:：Util'max'

my$len=最大映射长度，映射@$\u，值%fasta；
对于我的$key（@keys）{
打印$key“\n”；
my$seq=join“”，@{$fasta{$key}；
为$seq=~/.{1，$len}/g打印“$\n”；
}

给定原始数据，输出与上述解决方案的输出相同。我用这个作为输入

>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>序号2
HIJKLMN
OPQRSTU
VWXY

这样的结果。所有行已包装为11个字符-原始数据中最长

JKLMNOPQRST

行的长度

>REVseq1
ZYXWVUTSRQP
昂姆基赫菲
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH