Perl 查找文本序列并使用替换文本创建新文件_Perl_Replace

Perl 查找文本序列并使用替换文本创建新文件

perl replace

Perl 查找文本序列并使用替换文本创建新文件,perl,replace,Perl,Replace,我正在试图找到一种方法来编写一个脚本，它可以执行以下操作： #!/usr/bin/env perl use 5.014; use strict; use warnings; use Path::Tiny; use Bio::PDB::Structure; use Data::Dumper; my $residues_file = "input2.txt"; #residue names, one per line my $molfile = "m1.pdb"; #

我正在试图找到一种方法来编写一个脚本，它可以执行以下操作：

#!/usr/bin/env perl

use 5.014;
use strict;
use warnings;
use Path::Tiny;
use Bio::PDB::Structure;
use Data::Dumper;

my $residues_file = "input2.txt";   #residue names, one per line
my $molfile = "m1.pdb";             #molecule file

#read the residues
my(@residues) = path($residues_file)->lines({chomp => 1});

my $m= Bio::PDB::Structure::Molecule->new;

for my $res (@residues) {       #for each residue name from a file "input2.txt"
    $m->read("m1.pdb");         #read the molecule
    my $atom = $m->atom(0);     #get the 1st atom
    $atom->residue_name($res);  #change the residue to the from file

    #create output filename
    my $outfile = path($molfile)->basename('.pdb') . '_' . lc($res) . '.pdb';
    #write the result
    $m->print($outfile);
}

打开并检测输入文件中重复的三字母序列的首次使用

编辑并排列该三字母序列19次，给出19个输出，每个输出具有不同的三字母代码，对应于19个可能的三字母代码列表

本质上，这是一个相当简单的查找和替换问题，我知道如何做。问题是，我需要循环这个过程，这样，在上一行创建了19个文件之后，下一行使用不同的三个字母代码对其进行了相同的替换

我正在努力寻找一种让脚本识别文本序列的方法，尽管它可能是二十种不同的东西之一

让我知道，如果有人对我如何做这件事有任何想法，我也会提供任何必要的澄清

以下是输入文件的示例：

ATOM      1  N   SER A   2      37.396  -5.247  -4.830  1.00 65.06           N  
ATOM      2  CA  SER A   2      37.881  -6.354  -3.929  1.00 64.88           C  
ATOM      3  C   SER A   2      36.918  -7.555  -3.786  1.00 64.14           C  
ATOM      4  O   SER A   2      37.287  -8.576  -3.177  1.00 64.31           O  
ATOM      5  CB  SER A   2      38.251  -5.804  -2.552  1.00 65.31           C  
ATOM      6  OG  SER A   2      37.122  -5.210  -1.918  1.00 66.94           O  
ATOM      7  N   GLU A   3      35.705  -7.438  -4.342  1.00 62.82           N  
ATOM      8  CA  GLU A   3      34.716  -8.539  -4.306  1.00 61.94           C  
ATOM      9  C   GLU A   3      35.126  -9.833  -5.033  1.00 59.71           C  
ATOM     10  O   GLU A   3      34.927 -10.911  -4.473  1.00 59.23           O  
ATOM     11  CB  GLU A   3      33.328  -8.094  -4.789  1.00 62.49           C  
ATOM     12  CG  GLU A   3      32.291  -7.994  -3.693  1.00 66.67           C  
ATOM     13  CD  GLU A   3      31.552  -9.302  -3.426  1.00 71.93           C  
ATOM     14  OE1 GLU A   3      32.177 -10.254  -2.892  1.00 73.96           O  
ATOM     15  OE2 GLU A   3      30.329  -9.364  -3.723  1.00 74.25           O  
ATOM     16  N   PRO A   4      35.663  -9.732  -6.280  1.00 57.83           N  
ATOM     17  CA  PRO A   4      36.131 -10.951  -6.967  1.00 56.64           C

其中输出如下所示：

ATOM      1  N   ALA A   2      37.396  -5.247  -4.830  1.00 65.06           N  
ATOM      2  CA  SER A   2      37.881  -6.354  -3.929  1.00 64.88           C  
ATOM      3  C   SER A   2      36.918  -7.555  -3.786  1.00 64.14           C  
ATOM      4  O   SER A   2      37.287  -8.576  -3.177  1.00 64.31           O  
ATOM      5  CB  SER A   2      38.251  -5.804  -2.552  1.00 65.31           C  
ATOM      6  OG  SER A   2      37.122  -5.210  -1.918  1.00 66.94           O  
ATOM      7  N   GLU A   3      35.705  -7.438  -4.342  1.00 62.82           N  
ATOM      8  CA  GLU A   3      34.716  -8.539  -4.306  1.00 61.94           C  
ATOM      9  C   GLU A   3      35.126  -9.833  -5.033  1.00 59.71           C  
ATOM     10  O   GLU A   3      34.927 -10.911  -4.473  1.00 59.23           O  
ATOM     11  CB  GLU A   3      33.328  -8.094  -4.789  1.00 62.49           C          
ATOM     12  CG  GLU A   3      32.291  -7.994  -3.693  1.00 66.67           C  
ATOM     13  CD  GLU A   3      31.552  -9.302  -3.426  1.00 71.93           C  
ATOM     14  OE1 GLU A   3      32.177 -10.254  -2.892  1.00 73.96           O  
ATOM     15  OE2 GLU A   3      30.329  -9.364  -3.723  1.00 74.25           O  
ATOM     16  N   PRO A   4      35.663  -9.732  -6.280  1.00 57.83           N  
ATOM     17  CA  PRO A   4      36.131 -10.951  -6.967  1.00 56.64           C

在第一个过程中，SER应该被更改为20个不同的文本序列，第一个是ALA。我遇到的问题是，我不确定如何编写一个脚本来更改多行文本

我当前的脚本可以形成第一个SER的19个突变，但这就是它将停止的地方。它不会变异下一个，也不会变异不同的三个字母的代码，例如，它不会改变GLU。有没有简单的方法集成此功能

目前，我采用的方法是使用sed进行简单的文本转换，但由于这似乎比sed带来的更复杂，我认为perl可能是一种方法。我可以添加sed代码，但我认为这不会有多大帮助

您的问题和评论并不完全清楚，但我相信这个脚本可以满足您的要求。它解析PDB文件，直到它到达所需的氨基酸。生成一组19个文件，其中该AA由其他19个AAs替代。从那时起，每次AA与前一行中的AA不同时，将生成另一组19个文件

#!/usr/bin/perl
use warnings;
use strict;

# we're going to start mutating when we find this residue.
my $target = 'GLU';

my @aas = ( 'ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLU', 'GLN', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL' );

my $prev = '';
my $line_no = 0;
my @lines;
my %changes;

# uncomment the following lines and comment out "while (<DATA>) {"
# to read the input from a file

# my $input = 'path/to/pdb_file';
# open( my $fh, "<", $input ) or die "Could not open $input: $!";
# while (<$fh>) {
while (<DATA>) {
    # split the line into columns (assuming it is tab-delimited;
    # switch this for "\s+" if it is separated with whitespace.
    my @cols = split "\t";

    if ($target && $cols[3] eq $target) {
        # Found our target residue! unset $target so that the following
        # set of tests are performed
        undef $target;
    }

    # see if this AA is the same as the AA in the previous line
    if (! $target && $prev ne $cols[3]) {
        # if it isn't, store the line number and the amino acid
        $changes{ $line_no } = $cols[3];
        # update $prev to reflect the new AA
        $prev = $cols[3];
    }
    # store all the lines
    push @lines, $_;
    # increment the line number
    $line_no++;
}

# now, for each of the changes, create substitute files
for (keys %changes) {
    create_substitutes($_, $changes{$_}, [@aas], [@lines]);
}

sub create_substitutes {
    # arguments: line no, $res: residue, $aas: array of amino acids,
    # $all_lines: all lines in the file
    my ($line_no, $res, $aas, $all_lines) = @_;

    # this is the target line that we want to substitute
    my @target = split "\t", $all_lines->[$line_no];

    # for each AA in the list of AAs, create a new file called 'XXX-##.txt',
    # where XXX is the amino acid and ## is the line number where the
    # substituted residue is.
    for (@$aas) {
        next if $_ eq $res;
        open( my $fh, ">", $_."-$line_no.txt") or die "Could not create output file for $_: $!";
        # print out all lines up to the changed line
        print { $fh } @$all_lines[0..$line_no-1];
        # print out the changed line, substituting in the AA
        print { $fh } join "\t", @target[0..2], $_, @target[4..$#target];
        # print out the rest of the lines.
        print { $fh } @$all_lines[$line_no+1 .. $#{$all_lines}];
    }
}


__DATA__
ATOM    1   N   SER A   2   37.396  -5.247  -4.830  1.00    65.06   N
ATOM    2   CA  SER A   2   37.881  -6.354  -3.929  1.00    64.88   C
ATOM    3   C   SER A   2   36.918  -7.555  -3.786  1.00    64.14   C
ATOM    4   O   SER A   2   37.287  -8.576  -3.177  1.00    64.31   O
ATOM    5   CB  SER A   2   38.251  -5.804  -2.552  1.00    65.31   C
ATOM    6   OG  SER A   2   37.122  -5.210  -1.918  1.00    66.94   O
ATOM    7   N   GLU A   3   35.705  -7.438  -4.342  1.00    62.82   N
ATOM    8   CA  GLU A   3   34.716  -8.539  -4.306  1.00    61.94   C
ATOM    9   C   GLU A   3   35.126  -9.833  -5.033  1.00    59.71   C
ATOM    10  O   GLU A   3   34.927  -10.911 -4.473  1.00    59.23   O
ATOM    11  CB  GLU A   3   33.328  -8.094  -4.789  1.00    62.49   C
ATOM    12  CG  GLU A   3   32.291  -7.994  -3.693  1.00    66.67   C
ATOM    13  CD  GLU A   3   31.552  -9.302  -3.426  1.00    71.93   C
ATOM    14  OE1 GLU A   3   32.177  -10.254 -2.892  1.00    73.96   O
ATOM    15  OE2 GLU A   3   30.329  -9.364  -3.723  1.00    74.25   O
ATOM    16  N   PRO A   4   35.663  -9.732  -6.280  1.00    57.83   N
ATOM    17  CA  PRO A   4   36.131  -10.951 -6.967  1.00    56.64   C
ATOM    18  CA  ARG A   4   36.131  -10.951 -6.967  1.00    56.64   C

等等

如果这不是正确的行为，你将不得不编辑你的问题，因为它不是很清楚

因为你的问题不是很清楚，也不是很清楚，所以我创建了以下内容：

#!/usr/bin/env perl

use 5.014;
use strict;
use warnings;
use Path::Tiny;
use Bio::PDB::Structure;
use Data::Dumper;

my $residues_file = "input2.txt";   #residue names, one per line
my $molfile = "m1.pdb";             #molecule file

#read the residues
my(@residues) = path($residues_file)->lines({chomp => 1});

my $m= Bio::PDB::Structure::Molecule->new;

for my $res (@residues) {       #for each residue name from a file "input2.txt"
    $m->read("m1.pdb");         #read the molecule
    my $atom = $m->atom(0);     #get the 1st atom
    $atom->residue_name($res);  #change the residue to the from file

    #create output filename
    my $outfile = path($molfile)->basename('.pdb') . '_' . lc($res) . '.pdb';
    #write the result
    $m->print($outfile);
}

例如，如果input2.txt包含

根据您的输入，将生成20个文件，其中第一个原子中的剩余部分将根据您的输出示例进行更改，如下所示：

==> m1_ala.pdb <==
ATOM      1  N   ALA A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_arg.pdb <==
ATOM      1  N   ARG A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_asn.pdb <==
ATOM      1  N   ASN A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_asp.pdb <==
ATOM      1  N   ASP A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_cys.pdb <==
ATOM      1  N   CYS A   2      37.396  -5.247  -4.830  1.00 65.06

。。。等等，20次…

嗨，乔治，如果没有具体的例子，很难准确理解你的意思。请编辑您的问题，以包括一些示例输入和您希望生成的相应输出。如果你加入一段你正在使用的代码片段，这样我们就可以准确地看到你遇到的问题。请格式化你的文本。我已经添加了一个具体的输入和输出示例，希望这能让我想说的更容易理解。我不是生物化学家，但这些看起来像PDB文件。您可能会考虑查看一些现有的模块来解析PDB文件，例如，可能还有其他模块，这些只是我在快速搜索中发现的那些模块。这些可能具有内置功能，可以帮助您比编写自己的解析逻辑更轻松地解决问题。请展示您的代码。谢谢，我非常感谢。我真的很抱歉说不清楚，我正在努力描述这种行为。这是非常有用的，远远超过我的预期。再次感谢你。

==> m1_ala.pdb <==
ATOM      1  N   ALA A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_arg.pdb <==
ATOM      1  N   ARG A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_asn.pdb <==
ATOM      1  N   ASN A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_asp.pdb <==
ATOM      1  N   ASP A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_cys.pdb <==
ATOM      1  N   CYS A   2      37.396  -5.247  -4.830  1.00 65.06