Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/perl/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Perl程序,用于查找具有特定序列的k-mer_Perl - Fatal编程技术网

Perl程序,用于查找具有特定序列的k-mer

Perl程序,用于查找具有特定序列的k-mer,perl,Perl,我正在尝试增强我以前编写的perl程序,以便它能够识别以GG结尾的前1000个长度为23的k-mers,并打印出在序列中只出现一次的k-mers。然而,无论我在哪里添加reg exp,我都无法获得预期的结果 我的守则如下: #!/usr/bin/perl use strict; use warnings; my $k = 23; my $input = 'Fasta.fasta'; my $output = 'Fasta2.fasta'; my $ma

我正在尝试增强我以前编写的perl程序,以便它能够识别以GG结尾的前1000个长度为23的k-mers,并打印出在序列中只出现一次的k-mers。然而,无论我在哪里添加reg exp,我都无法获得预期的结果

我的守则如下:

#!/usr/bin/perl
use strict;
use warnings;

my $k           = 23;
my $input       = 'Fasta.fasta';
my $output      = 'Fasta2.fasta';
my $match_count = 0;

#Open File
unless ( open( FASTA, "<", $input ) ) {
    die "Unable to open fasta file", $!;
}

#Unwraps the FASTA format file
$/ = ">";

#Separate header and sequence
#Remove spaces
unless ( open( OUTPUT, ">", $output ) ) {
    die "Unable to open file", $!;
}

<FASTA>;    # discard 'first' 'empty' record

my %seen;
while ( my $line = <FASTA> ) {
    chomp $line;
    my ( $header, @seq ) = split( /\n/, $line );
    my $sequence = join '', @seq;

    for ( length($sequence) >= $k ) {
        $sequence =~ m/([ACTG]{21}[G]{2})/g;

        for my $i ( 0 .. length($sequence) - $k ) {
            my $kmer = substr( $sequence, $i, $k );

            ##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
            $match_count = $match_count + 1;
            print OUTPUT ">crispr_$match_count", "\n", "$kmer", "\n" unless $seen{$kmer}++;
        }
    }
}
等等

预期结果(打印出序列中仅出现一次且以GG结尾的23k MER)我希望得到:

>crispr_1
GGGTGGAGCTCCCGAAATGCAGG
>crispr_2
TTAATAAATATTGACACAGCGGG
>crispr_3
ATCGTGGGGCGTTTTGTGAAAGG
>crispr_4
AGTTTTTCACATAATCAGACAGG
>crispr_5
GTGTTGGATGAGTGTCCTCTGGG
>crispr_6
ATAGGTTGGTTGTTTTAAAAGGG
>crispr_7
AAATTTTTGTTGCCACTGAATGG
>crispr_8
AAGTTTCGAACTACGATGGTTGG
>crispr_9
CATGCTTTGTGGAAATAAGTCGG
>crispr_10
CACAGTGGGTGTTTGCACCTCGG
.... and so on
我使用以下代码创建了一个fasta文件:

>crispr_1
CGACAATGCACGACAGAGGAAGC
>crispr_2
GACAATGCACGACAGAGGAAGCA
>crispr_3
ACAATGCACGACAGAGGAAGCAG
>crispr_4
CAATGCACGACAGAGGAAGCAGA
>crispr_5
AATGCACGACAGAGGAAGCAGAA
>crispr_6
ATGCACGACAGAGGAAGCAGAAC
>crispr_7
TGCACGACAGAGGAAGCAGAACA
>crispr_8
GCACGACAGAGGAAGCAGAACAG
>crispr_9
CACGACAGAGGAAGCAGAACAGA
>crispr_10
ACGACAGAGGAAGCAGAACAGAT
.... and so on
而如果我移除

for (length($sequence) >=$k){
$sequence =~m/([ACTG]{21}[G]{2})/g;
并添加##while($kmer=~m/([ACTG]{21}[G]{2})/G){

我正在获取fasta文件(结果编号不正确,无法区分重复序列和唯一序列):

我试图在我的代码中移动正则表达式,但没有一个生成预期的结果。我不知道我在这里做错了什么。我还没有在代码中添加计数达到1000时退出程序


提前谢谢!

实际上,这一行代码

$sequence =~m/([ACTG]{21}[G]{2})/g;
此行仅用于正则表达式匹配,如果您尝试打印此
$sequence
,它肯定会打印出原始结果

请像这样添加代码段

if($sequence =~/([ACTG]{21}[G]{2}$)/g) 
{


}#please remember to match the end with $.

顺便说一句,解析此数据的多重for循环似乎不是很合理,解析速度没有达到最佳效率。

我不确定我是否完全理解您的问题,但以下内容可能是您需要的

<FASTA>; # discard 'first' 'empty' record

my %data;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);

        $data{$kmer}++ if $kmer =~ /GG$/;
    }
}
my $i = 0;
for my $kmer (sort {$data{$b} <=> $data{$a}} keys %data) {
    printf "crispr_%d\n%s appears %d times\n", ++$i, $kmer, $data{$kmer};
    last if $i == 1000; 
}
更新 要获得您在评论(以下)中提到的结果,请将输出代码替换为:

my $i = 1;

while (my ($kmer, $count) = each %data) {
    next unless $count == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}
回答我自己的评论,获得第一个1000


Eric这种方法不起作用,它会打印出一个空白的fasta文件。你能给我们提供一些吗(必要时减少)输入以及您希望从该输入中获得的输出。当您看不到正在处理的数据时,很难提供帮助。@DaveCross我提供了输入文件;它只是一个包含基因序列的通用fasta文件。谢谢。我很困惑:第一行(用于
crispr\u 1
)is:
catttctctcccatattagg
但是在您显示的输入文件中没有连续的序列与之匹配。您是如何获得特定序列的?@HåkonHægland在那里捕捉得很好。我没有尝试将序列直接与文件匹配,所以我没有注意到。特定的序列就像给我的提示一样我期望的前几个序列是什么?你的“期望结果”完全不清楚,另一方面,最后一次尝试返回
catttctcccatattagg
attttctcccatattaggg
tattgctctttgattttgg
似乎是正确的。我试图实现的是打印出前1000个23公里的序列,以只出现1次的GG结尾。谢谢!@Sunny我已经仔细考虑过了我的解决方案不会打印以GG结尾的第一个Kmer。它会以随机顺序打印文件中出现过一次的任何Kmer。如果您需要前1000个Kmer,则需要另一个解决方案。我检查了您更新代码的结果输出,并且head-100和tail-100与预期的输出结果匹配。而且它不匹配eem也将以随机顺序生成。因此,我认为此解决方案没有问题。非常感谢您的检查!我使用了diff命令,并确保预期输出与此代码生成的输出之间没有差异。谢谢!仅供将来参考,如果我希望只打印具有唯一12 nuc的23公里的赛车的话leotide endings,我是否将正则表达式更改为($kmer=~/.{10}GG$/)?谢谢。
if($sequence =~/([ACTG]{21}[G]{2}$)/g) 
{


}#please remember to match the end with $.
<FASTA>; # discard 'first' 'empty' record

my %data;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);

        $data{$kmer}++ if $kmer =~ /GG$/;
    }
}
my $i = 0;
for my $kmer (sort {$data{$b} <=> $data{$a}} keys %data) {
    printf "crispr_%d\n%s appears %d times\n", ++$i, $kmer, $data{$kmer};
    last if $i == 1000; 
}
crispr_1
ggttttccggcacccgggcctgg appears 4 times
crispr_2
ccgagctgggcgagaagtagggg appears 4 times
crispr_3
gccgagctgggcgagaagtaggg appears 4 times
crispr_4
gcacccgggcctgggtggcaggg appears 4 times
crispr_5
agcagcgggatcgggttttccgg appears 4 times
crispr_6
gctgggcgagaagtaggggaggg appears 4 times
crispr_7
cccttctgcttcagtgtgaaagg appears 4 times
crispr_8
gtggcagggaagaatgtgccggg appears 4 times
crispr_9
gatcgggttttccggcacccggg appears 4 times
crispr_10
tgagggaaagtgctgctgctggg appears 4 times
crispr_11
agctgggcgagaagtaggggagg appears 4 times

. . . .

ggcacccgggcctgggtggcagg appears 4 times
crispr_50
gaatctctttactgcctggctgg appears 4 times
crispr_51
accacaacattgacagttggtgg appears 2 times
crispr_52
caacattgacagttggtggaggg appears 2 times
crispr_53
catgctcatcgtatctgtgttgg appears 2 times
crispr_54
gattaatgaagtggttattttgg appears 2 times
crispr_55
gaaaccacaacattgacagttgg appears 2 times
crispr_56
aacattgacagttggtggagggg appears 2 times
crispr_57
gacttgatcgattaatgaagtgg appears 2 times
crispr_58
acaacattgacagttggtggagg appears 2 times
crispr_59
gaaccatatattgttatcactgg appears 2 times
crispr_60
ccacagcgcccacttcaaggtgg appears 1 times
crispr_61
ctgctcctgggtgtgagcagagg appears 1 times
crispr_62
ccatatattatctgtggtttcgg appears 1 times

. . . .
my $i = 1;

while (my ($kmer, $count) = each %data) {
    next unless $count == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}
<FASTA>; # discard 'first' 'empty' record

my %seen;
my @kmers;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);

        if ($kmer =~ /GG$/) {
            push @kmers, $kmer unless $seen{$kmer}++;
        }
    }
}

my $i = 1;
for my $kmer (@kmers) {
    next unless $seen{$kmer} == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}
        if ($kmer =~ /(.{10}GG)$/) {
            my $substr = $1;
            push @kmers, $kmer unless $seen{$substr}++;
        }

my $i = 1;
for my $kmer (@kmers) {
    my $substr = substr $kmer, -12;
    next unless $seen{$substr} == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}