Perl 如何将PHYLIP格式转换为FASTA
我刚开始使用perl,我有一个问题。我有PHYLIP文件,我需要将其转换为FASTA。我开始写剧本。首先,我删除了行中的空格,现在我需要对齐所有行,每行中应该有60个氨基酸,序列标识符应该打印在新的行中。也许有人能给我一些建议?BioPerl模块可能会有所帮助。它支持序列格式: phylip2fasta.plPerl 如何将PHYLIP格式转换为FASTA,perl,bioinformatics,Perl,Bioinformatics,我刚开始使用perl,我有一个问题。我有PHYLIP文件,我需要将其转换为FASTA。我开始写剧本。首先,我删除了行中的空格,现在我需要对齐所有行,每行中应该有60个氨基酸,序列标识符应该打印在新的行中。也许有人能给我一些建议?BioPerl模块可能会有所帮助。它支持序列格式: phylip2fasta.pl use strict; use warnings; use Bio::AlignIO; # http://doc.bioperl.org/bioperl-live/Bio/AlignI
use strict;
use warnings;
use Bio::AlignIO;
# http://doc.bioperl.org/bioperl-live/Bio/AlignIO.html
# http://doc.bioperl.org/bioperl-live/Bio/AlignIO/phylip.html
# http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format
my ($inputfilename) = @ARGV;
die "must provide phylip file as 1st parameter...\n" unless $inputfilename;
my $in = Bio::AlignIO->new(-file => $inputfilename ,
-format => 'phylip',
-interleaved => 1);
my $out = Bio::AlignIO->new(-fh => \*STDOUT ,
-format => 'fasta');
while ( my $aln = $in->next_aln() ) {
$out->write_aln($aln);
}
$perl phylip2fasta.pl test.phylip
>Turkey/1-42
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>Salmo_gair/1-42
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
>H._Sapiens/1-42
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>Chimp/1-42
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
>Gorilla/1-42
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
test.phylip
如果您可以访问BioPerl,我建议您使用它(参见其他答案)。如果没有,下面是我几年前在一个旧的硬件作业中使用的一个快速脚本。它可能对你有用 需要注意的一点是:它在一行上打印整个fasta序列,因此您必须编辑最后的打印语句,以便每行打印70 AA
#!/usr/bin/perl
use warnings;
use strict;
<DATA> =~ /(\d+)/; # first number is number of species
my $num_species = $1;
my $i = 0;
my @species;
my @acids;
# first $num_species rows have the species name
for ($i = 0; $i < $num_species; $i++) {
my @line = split /\s+/, <DATA>;
chomp @line;
push @species, shift (@line);
push @acids, join ("", @line);
}
# Get the rest of the AAs
$i = 0;
while (<DATA>) {
chomp;
$_ =~ s/\r//g; #remove \r
next if !$_;
$_ =~ s/\s+//g; #remove spaces
$acids[$i] .= $_;
$i = ++$i % $num_species;
}
# Print them
for ($i = 0; $i < $num_species; $i++) {
print "> ", $species[$i], "\n";
# uncomment next line if you want to remove the gaps ("-")
$acids[$i] =~ s/-//g;
print $acids[$i], "\n\n";
}
# Simple PHYLIP Amino Acid file
__DATA__
10 234
Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL
Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL
Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL
Loach MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL
Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL
Rat MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL
Seal MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Whale MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL
Frog MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL
THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM
S-SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI
TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI
TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM
THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETIWTILPA VILILIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEV NNPSLTVKTM
TNTNLMDAQE IEMVWTIMPA ISLIMIALPS LRILYLMDEV NDPHLTIKAI
GHQWYWSYEY TDYEDLSFDS YMIPTSELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYENLGFDS YMVPTQDLAP GQFRLLETDH RMVVPMESPV
GHQWYWTYEY TDFKDLSFDS YMTPTTDLPL GHFRLLEVDH RIVIPMESPI
GHQWYWTYEY TDYGGLIFNS YMLPPLFLEP GDLRLLDVDN RVVLPIEAPI
GHQWYWSYEY TDYENLSFDS YMIPTQDLTP GQFRLLETDH RMVVPMESPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLNFDS YMIPTQELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYEDLSFDS YMIPTSDLKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TNYEDLSFDS YMIPTNDLTP GQFRLLEVDN RMVVPMESPT
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSSRPG LYYGQCSEIC
RVLVSAEDVL HSWAVPSLGV KMDAVPGRLN QAAFIASRPG VFYGQCSEIC
RVIITADDVL HSWAVPALGV KTDAIPGRLN QTSFITTRPG VFYGQCSEIC
RMMITSQDVL HSWAVPTLGL KTDAIPGRLN QTTFTATRPG VYYGQCSEIC
RILVSAEDVL HSWALPAMGV KMDAVPGRLN QTAFIASRPG VFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAIPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMTMRPG LYYGQCSEIC
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSTRPG LFYGQCSEIC
RLLVTAEDVL HSWAVPSLGV KTDAIPGRLH QTSFIATRPG VFYGQCSEIC
GSNHSFMPIV LELVPLKYFE KWSASML--- ----
GANHSFMPIV VEAVPLEHFE NWSSLMLEDA SLGS
GANHSYMPIV VESTPLKHFE AWSSL----- -LSS
GANHSFMPIV LELIPLKIFE M-------GP VFTL
GANHSFMPIV VEAVPLSHFE NWSTLMLKDA SLGS
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LELVPLSHFE KWSTSML--- ----
GSNHSFMPIV LELVPLEVFE KWSVSML--- ----
GANHSFMPIV VEAVPLTDFE NWSSSML-EA SL--
我认为doc.bioperl.org网站需要一些工作。与…相比。我想这应该是意料之中的,因为生成它的代码从那以后就没有更新过。谢谢你的anser,但是没有bioperl我怎么能做到呢?只是出于好奇,为什么你需要转换成fasta,它是否与你需要的fasta对齐。
#!/usr/bin/perl
use warnings;
use strict;
<DATA> =~ /(\d+)/; # first number is number of species
my $num_species = $1;
my $i = 0;
my @species;
my @acids;
# first $num_species rows have the species name
for ($i = 0; $i < $num_species; $i++) {
my @line = split /\s+/, <DATA>;
chomp @line;
push @species, shift (@line);
push @acids, join ("", @line);
}
# Get the rest of the AAs
$i = 0;
while (<DATA>) {
chomp;
$_ =~ s/\r//g; #remove \r
next if !$_;
$_ =~ s/\s+//g; #remove spaces
$acids[$i] .= $_;
$i = ++$i % $num_species;
}
# Print them
for ($i = 0; $i < $num_species; $i++) {
print "> ", $species[$i], "\n";
# uncomment next line if you want to remove the gaps ("-")
$acids[$i] =~ s/-//g;
print $acids[$i], "\n\n";
}
# Simple PHYLIP Amino Acid file
__DATA__
10 234
Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL
Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL
Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL
Loach MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL
Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL
Rat MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL
Seal MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Whale MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL
Frog MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL
THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM
S-SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI
TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI
TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM
THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETIWTILPA VILILIALPS LRILYMMDEI NNPVLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEV NNPSLTVKTM
TNTNLMDAQE IEMVWTIMPA ISLIMIALPS LRILYLMDEV NDPHLTIKAI
GHQWYWSYEY TDYEDLSFDS YMIPTSELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYENLGFDS YMVPTQDLAP GQFRLLETDH RMVVPMESPV
GHQWYWTYEY TDFKDLSFDS YMTPTTDLPL GHFRLLEVDH RIVIPMESPI
GHQWYWTYEY TDYGGLIFNS YMLPPLFLEP GDLRLLDVDN RVVLPIEAPI
GHQWYWSYEY TDYENLSFDS YMIPTQDLTP GQFRLLETDH RMVVPMESPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
GHQWYWSYEY TDYEDLNFDS YMIPTQELKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TDYEDLSFDS YMIPTSDLKP GELRLLEVDN RVVLPMEMTI
GHQWYWSYEY TNYEDLSFDS YMIPTNDLTP GQFRLLEVDN RMVVPMESPT
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSSRPG LYYGQCSEIC
RVLVSAEDVL HSWAVPSLGV KMDAVPGRLN QAAFIASRPG VFYGQCSEIC
RVIITADDVL HSWAVPALGV KTDAIPGRLN QTSFITTRPG VFYGQCSEIC
RMMITSQDVL HSWAVPTLGL KTDAIPGRLN QTTFTATRPG VYYGQCSEIC
RILVSAEDVL HSWALPAMGV KMDAVPGRLN QTAFIASRPG VFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAIPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMTMRPG LYYGQCSEIC
RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSTRPG LFYGQCSEIC
RLLVTAEDVL HSWAVPSLGV KTDAIPGRLH QTSFIATRPG VFYGQCSEIC
GSNHSFMPIV LELVPLKYFE KWSASML--- ----
GANHSFMPIV VEAVPLEHFE NWSSLMLEDA SLGS
GANHSYMPIV VESTPLKHFE AWSSL----- -LSS
GANHSFMPIV LELIPLKIFE M-------GP VFTL
GANHSFMPIV VEAVPLSHFE NWSTLMLKDA SLGS
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
GSNHSFMPIV LELVPLSHFE KWSTSML--- ----
GSNHSFMPIV LELVPLEVFE KWSVSML--- ----
GANHSFMPIV VEAVPLTDFE NWSSSML-EA SL--
> Cow
MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML
> Carp
MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQEIEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDSYMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLNQAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS
> Chicken
MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLSSNTVDAQEVELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDSYMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLNQTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSLLSS
> Human
MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEMGPVFTL
> Loach
MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQEIEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDSYMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLNQTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS
> Mouse
MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI
> Rat
MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI
> Seal
MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDSYMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML
> Whale
MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML
> Frog
MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQEIEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDSYMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLHQTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSMLEASL