Arrays 将包含在数字范围内的哈希键分组
我有一个数据集,包含了几种不同方法的组合尝试 (Arrays 将包含在数字范围内的哈希键分组,arrays,perl,hash,Arrays,Perl,Hash,我有一个数据集,包含了几种不同方法的组合尝试 (接近1到3)以确定基因组中的位置: source chromosome1 bp1 chromosome2 bp2 attempt1 2L 5890205 2L 5890720 attempt2 2L 5890205 2L 5890721 attempt1 2L 22220720 2L 22255744 attempt1 3L 15568694 3L 15568866 attempt3 3R
接近1到3
)以确定基因组中的位置:
source chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
attempt1 2L 22220720 2L 22255744
attempt1 3L 15568694 3L 15568866
attempt3 3R 14006279 3R 14008254
attempt1 3R 14006281 3R 14008253
attempt2 3R 14006282 3R 14008254
attempt3 3R 14006286 3R 14008254
attempt1 3R 32060908 3R 32061196
attempt1 3R 32066206 3R 32068392
attempt3 3R 32066206 3R 32068392
attempt2 3R 32066207 3R 32068393
attempt2 X 4574312 X 4576608
attempt1 X 4574313 X 4576607
attempt3 X 4574313 X 4576608
我希望找到并分组每次尝试都已确定的位置,为错误留出一点空间。例如,我想把前两行
source chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
…作为单个事件(event 1
),已通过两次不同的尝试(attempt1
和attempt2
)识别。我只想在以下情况下将此类实例归类为单个事件:
- 就
bp1
5的位置达成一致(即在窗口+/-
内)5890200..5890210
- 识别相同的
和chromosome1
(chromosome2
)2L
- 就bp2的位置达成一致(即在窗口内
)5890715..5890725
my %SVs;
my $header;
# Make hash
while(<$in>){
chomp;
if ($. == 1){
$header = $_;
next;
}
my ($source, $chromosome1, $bp1, $chromosome2, $bp2) = split;
push @{$SVs{$chromosome1}{$bp1}{$chromosome2}{$bp2}}, $_;
}
}
下面通过添加到等价类的最后一个条目定义了每个等价类(基于我对您上述评论的理解):
如果您有三次尝试,并且#1和#2同意,以及#2和#3同意(但是#1和#3不同意),该怎么办。那么,第二次尝试属于哪一个事件呢?好问题。这对我来说意味着它们应该被分组,然后我想扩展我的窗口(例如,到+/-10),以允许这项工作非常出色。你能解释一下
,除非(@{$events[-1]})
正在做什么吗?@fugu对代码进行了一些简化,以摆脱这种检查。原则上,您还应该检查读取标题和第一行数据是否成功,但我猜输入在某种程度上是由您控制的。
my %events;
for my $chr1 ( sort keys %SVs ){
for my $bp1 ( sort { $a <=> $b } keys $SVs{$chr1} ){
my $w1_start = ( $bp1 - 5 );
my $w1_end = ( $bp1 + 5 );
my $window1 = "$w1_start-$w1_end";
for my $chr2 ( sort keys $SVs{$chr1}{$bp1} ){
for my $bp2 ( sort { $a <=> $b } keys $SVs{$chr1}{$bp1}{$chr2} ){
my $w2_start = ( $bp2 - 5 );
my $w2_end = ( $bp2 + 5 );
my $window2 = "$w2_start-$w2_end";
for ( $w1_start .. $w1_end ){
if ($bp1 == $_){
push @{$events{$chr1}{$window1}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
}
}
for ( $w2_start .. $w2_end ){
if ($bp2 == $_){
push @{$events{$chr2}{$window2}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
}
}
}
}
}
}
print Dumper \%events;
event source chromosome1 bp1 chromosome2 bp2
1 attempt1 2L 5890205 2L 5890720
1 attempt2 2L 5890205 2L 5890721
2 attempt1 2L 22220720 2L 22255744
3 attempt1 3L 15568694 3L 15568866
4 attempt3 3R 14006279 3R 14008254
4 attempt1 3R 14006281 3R 14008253
4 attempt2 3R 14006282 3R 14008254
4 attempt3 3R 14006286 3R 14008254
5 attempt1 3R 32060908 3R 32061196
6 attempt1 3R 32066206 3R 32068392
6 attempt3 3R 32066206 3R 32068392
6 attempt2 3R 32066207 3R 32068393
7 attempt2 X 4574312 X 4576608
7 attempt1 X 4574313 X 4576607
7 attempt3 X 4574313 X 4576608
#!/usr/bin/env perl
use strict;
use warnings;
run(\*DATA);
sub run {
my $fh = shift;
my @header = split ' ', scalar <$fh>;
my @events = ([ get_next_event($fh, \@header)]);
while (my $event = get_next_event($fh, \@header)) {
# change the -1 in the second subscript to 0
# if you want to always compare to the first
# event added to the equivalence class
if (same_event($events[-1][-1], $event, 5)) {
push @{ $events[-1] }, $event;
next;
}
push @events, [ $event ];
}
print join("\t", event => @header), "\n";
for my $i (1 .. @events) {
for my $ev (@{ $events[$i - 1] }) {
print join("\t", $i, @{$ev}{@header}), "\n";
}
}
}
sub get_next_event {
my $fh = shift;
my $header = shift;
return unless defined(my $line = <$fh>);
return unless $line =~ /\S/;
my %event;
@event{ @$header } = split ' ', $line;
return \%event;
}
sub same_event {
my ($x, $y, $threshold) = @_;
return if $x->{chromosome1} ne $y->{chromosome1};
return if abs($x->{bp1} - $y->{bp1}) > $threshold;
return if abs($x->{bp2} - $y->{bp2}) > $threshold;
return 1;
}
__DATA__
source chromosome1 bp1 chromosome2 bp2
attempt1 2L 5890205 2L 5890720
attempt2 2L 5890205 2L 5890721
attempt1 2L 22220720 2L 22255744
attempt1 3L 15568694 3L 15568866
attempt3 3R 14006279 3R 14008254
attempt1 3R 14006281 3R 14008253
attempt2 3R 14006282 3R 14008254
attempt3 3R 14006286 3R 14008254
attempt1 3R 32060908 3R 32061196
attempt1 3R 32066206 3R 32068392
attempt3 3R 32066206 3R 32068392
attempt2 3R 32066207 3R 32068393
attempt2 X 4574312 X 4576608
attempt1 X 4574313 X 4576607
attempt3 X 4574313 X 4576608