如何使用perl合并和处理文件中的多行以生成报告
我是新的perl,只是尝试了一些凌乱的代码 cat input1.txt如何使用perl合并和处理文件中的多行以生成报告,perl,Perl,我是新的perl,只是尝试了一些凌乱的代码 cat input1.txt ##gff-version 2 ##source-version geneious 5.6.4 Xm_ABL1 Geneious CDS 1 168 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4 Xm_ABL1 Geneious CDS
##gff-version 2
##source-version geneious 5.6.4
Xm_ABL1 Geneious CDS 1 168 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 169 334 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 335 628 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 629 901 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 902 985 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 986 1165 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 1166 1350 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious CDS 1351 1504 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
Xm_ABL1 Geneious BLAST Hit 169 334 . + .
Xm_ABL1 Geneious extracted region 1 168 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="351297 -> 351464"
Xm_ABL1 Geneious extracted region 169 334 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="371785 -> 371950"
Xm_ABL1 Geneious extracted region 335 628 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="372554 -> 372847"
Xm_ABL1 Geneious extracted region 629 901 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="374760 -> 375032"
Xm_ABL1 Geneious extracted region 902 985 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375230 -> 375313"
Xm_ABL1 Geneious extracted region 986 1165 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="375992 -> 376171"
Xm_ABL1 Geneious extracted region 1166 1350 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376575 -> 376759"
Xm_ABL1 Geneious extracted region 1351 1504 . + . Name=Extracted region from gi|371443098|gb|JH556762.1|;Extracted interval="376914 -> 377067"
如果输入文件包含(->)前进箭头。我希望输出像
如果($array[7]=~/.*interval=\“\d+->\d+\”$/gm){$array[5]=“+”;}
cat output1.txt
gi_371443098_gb_JH556762.1 gene 351297 377067 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 351297 351464 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 371785 371950 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 372554 372847 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 374760 375032 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 375230 375313 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 375992 376171 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 376575 376759 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
gi_371443098_gb_JH556762.1 CDS 376914 377067 . + . Name=Xm_ABL1;created by=User;modified by=User;ID=w0IVHutPuN4H4FVDCg4sFVRaJjQ.1340919460469.4
###
cat output1.txt
如果输入文件包含(看起来您试图实现的是将标记为CDS的行中的详细信息与标记为提取区域的匹配行合并,然后根据一些最小值和最大值,按名称分组,使用前导摘要标题打印合并结果。是否正确 我将假设您所称的$array[0](Xm_ABL1 generious)和$array[2](169335等)足以将它们结合在一起,但这在您的示例中并不十分清楚 您的第一个问题只是一个regexp,我认为您已经掌握了它的一般窍门。我认为问题在于您如何捕获数据 要执行您要求的第二件事,请在第一次传递中捕获hi和lo值,并存储它们 我不打算写一个完整的解决方案,但它在这里
use strict;
use warnings;
my $metadata = {}; # hashref to store CDS info in..
my $group = {}; # hashref to store summary/detail in..
my $arrow = { "->" => '+', "<-" => '-' }; # decode arrow to pos/neg
open(FH,"$ARGV[0]");
while(<FH>){
chomp;
next if /^#/;
my @array=split("\t");
my $key = join(":", $array[0], $array[2]);
if ($array[1] =~ /CDS/){
$metadata->{$key} = $array[7];
}
if ($array[1] =~ /extracted region/){
#assert CDS already processed..
die "No CDS record for $key!\n" unless $metadata->{$key};
(my $label = $array[7]) =~ s/.*region from (.*)\|;.*/$1/;
$label =~ s/\|/_/g;
$group->{$label} ||= { #seed summary if not exists
pos1 => 1e10,
pos2 => 0,
metadata => $metadata->{$key},
sequences => [],
};
(my $pos1, my $arr, my $pos2) = ($array[7]=~/.*interval=\"(\d+) (<?->?) (\d+)\"$/gm);
# capture hi/lo values for group
$group->{$label}->{pos1} = $pos1 if $pos1 < $group->{$label}->{pos1};
$group->{$label}->{pos2} = $pos2 if $pos2 > $group->{$label}->{pos2};
# push this sequence onto the group's array
push(@{ $group->{$label}->{sequences} }, [ $pos1, $pos2, $arrow->{$arr} ]);
}
}
for my $gene (sort keys %{ $group }){
#write out header
printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
$gene, 'gene',
$group->{$gene}->{pos1}, $group->{$gene}->{pos2},
$group->{$gene}->{sequences}->[0]->[2],
$group->{$gene}->{metadata};
foreach my $sequence ( @{ $group->{$gene}->{sequences} } ){
# write out details
printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
$gene, 'CDS',
$sequence->[0], $sequence->[1], $sequence->[2],
$group->{$gene}->{metadata};
}
}
print "###\n";
使用严格;
使用警告;
my$metadata={};#hashref以存储CD信息。。
my$group={};#hashref以存储摘要/详细信息。。
我的$arrow={“->”=>“+”,“我使用的是perl 5版本,因此添加到代码中:没有“未初始化”的警告;它是用perl5编写的。当我使用示例数据运行它时,我没有收到警告,因此可能完整数据与示例中不明显的不一致。我建议您“解决原因,而不是结果”“。修改代码以仅在必要时允许未初始化的值,方法是使用if($mably_null_variable | | 0)>10)
。此外,如果我的答案解决了您的问题,那么礼仪是将其标记为“正确”回答和/或投票,这样我的努力就得到了赞扬。为了加强前面的评论,没有“未初始化”的警告;
是一个非常糟糕的主意。在近20年的perl编写过程中,我只使用过几次pragma,而且只使用代码块或子块的局部范围,从来没有将其作为全局开关。程序正在告诉我们你“这里发生了意想不到的事情,或者你没有考虑到的事情”。所以请注意!如果你汽车仪表板上的机油灯亮起,你会检查机油,还是干脆把灯泡拿出来?这是相同的原理。更改序列arrayref(/extracted region/
块的最后一行)中输入的顺序以便顺序与最终输出匹配。请参见上面的编辑。不是$array[5]已经等于+或-?很难从你的原始样本中知道。如果一个序列有A,它们都会吗?你希望在基因头记录中有什么?你可能喜欢考虑编辑原来的问题,而不是添加评论-这将是更容易阅读和解释。
#usr/bin/perl;
use strict;
open(FH,"$ARGV[0]");
while(<FH>){
chomp $_;
my @array=split("\t");
my $key="$array[2]-$array[0]-$array[1]-$array[2]-$array[3]";
if($array[1] eq "CDS"){
$cds_cnt{$key}++;
$cds{$key}="$array[4]\t$array[5]\t$array[6]\t$array[7]";
}
if($array[1] eq "extracted region"){
(my $pos1,my $pos2)=($array[7]=~/.*interval=\"(\d+) -> (\d+)\"$/gm);
$extract_cnt{$key}++;
$extract{$key}="$pos1\t$pos2";
}
}
foreach $i ( sort {$a<=>$b} keys %cds){
my $a=$i; #print "$i\n";
$a=~s/CDS/extracted region/g;
if($cds_cnt{$i} == $extract_cnt{$a}){
#print "$i\t$cds{$i}\n$a\t$extract{$a}\n";
my @array=split /\-/,$i;
my @pos=split "\t",$extract{$a};
print "$array[1]\t$array[2]\t$pos[0]\t$pos[1]\t$cds{$i}\n";
}
}
print "###";
use strict;
use warnings;
my $metadata = {}; # hashref to store CDS info in..
my $group = {}; # hashref to store summary/detail in..
my $arrow = { "->" => '+', "<-" => '-' }; # decode arrow to pos/neg
open(FH,"$ARGV[0]");
while(<FH>){
chomp;
next if /^#/;
my @array=split("\t");
my $key = join(":", $array[0], $array[2]);
if ($array[1] =~ /CDS/){
$metadata->{$key} = $array[7];
}
if ($array[1] =~ /extracted region/){
#assert CDS already processed..
die "No CDS record for $key!\n" unless $metadata->{$key};
(my $label = $array[7]) =~ s/.*region from (.*)\|;.*/$1/;
$label =~ s/\|/_/g;
$group->{$label} ||= { #seed summary if not exists
pos1 => 1e10,
pos2 => 0,
metadata => $metadata->{$key},
sequences => [],
};
(my $pos1, my $arr, my $pos2) = ($array[7]=~/.*interval=\"(\d+) (<?->?) (\d+)\"$/gm);
# capture hi/lo values for group
$group->{$label}->{pos1} = $pos1 if $pos1 < $group->{$label}->{pos1};
$group->{$label}->{pos2} = $pos2 if $pos2 > $group->{$label}->{pos2};
# push this sequence onto the group's array
push(@{ $group->{$label}->{sequences} }, [ $pos1, $pos2, $arrow->{$arr} ]);
}
}
for my $gene (sort keys %{ $group }){
#write out header
printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
$gene, 'gene',
$group->{$gene}->{pos1}, $group->{$gene}->{pos2},
$group->{$gene}->{sequences}->[0]->[2],
$group->{$gene}->{metadata};
foreach my $sequence ( @{ $group->{$gene}->{sequences} } ){
# write out details
printf "%s\t%s\t%d\t%d\t.\t%s\t.\t%s\n",
$gene, 'CDS',
$sequence->[0], $sequence->[1], $sequence->[2],
$group->{$gene}->{metadata};
}
}
print "###\n";