使用Perl哈希处理制表符分隔的文件
我有两个文件:使用Perl哈希处理制表符分隔的文件,perl,perl-data-structures,Perl,Perl Data Structures,我有两个文件: 文件_1有三列(标记()、染色体和位置) 文件2有三列(染色体、峰值开始和峰值结束) 除SNP列外,所有列均为数字 文件的排列如屏幕截图所示。文件1有几百个SNP作为行,而文件2有61个峰值。每个峰值由峰值开始和峰值结束标记。在任何一个文件中都可能有23条染色体中的任何一条,文件2中每个染色体都有几个峰 我想知道,对于每个匹配的染色体,文件_1中SNP的位置是否在文件_2中的峰值_开始和峰值_结束范围内。如果是这样,我想显示哪个SNP落在哪个峰值(最好将输出写入以制表符分隔的
- 文件_1有三列(标记()、染色体和位置)
- 文件2有三列(染色体、峰值开始和峰值结束)
#!usr/bin/perl
严格使用;
使用警告;
我的百分比峰值,%X81\u 05);
我的@数组;
#打开文件或死亡
除非(打开(第一个样本,“X81\u 05.txt”)){
die“无法打开X81_05.txt”;
}
#将以制表符分隔的文件拆分为各个字段
而(){
大口大口;
下一个if(m/chromose/);#跳过标题
@数组=拆分(“\t”,元);
($chr1,$pos,$sample)=@array;
$X81_05{'$array[0]}=(
'位置'=>'$array[1]'
)
}
关闭(第一个样本);
#使用文件句柄打开文件
除非(打开(PEAKS,“PEAKS.txt”)){
die“无法打开peaks.txt”;
}
我的($chr、$peak\U start、$peak\U end);
而(){
大口大口;
下一个if(m/chromose/);#跳过标题
($chr,$peak\U start,$peak\U end)=拆分(/\t/);
$peaks{$chr}{'peak_start'}=$peak_start;
$peaks{$chr}{'peak_end'}=$peak_end;
}
接近(峰值);
对于我的$chr1(键%X81\u 05){
my$val=$X81_05{$chr1}{'position'};
对于我的$chr(键%peaks){
my$min=$peaks{$chr}{'peak_start'};
my$max=$peaks{$chr}{'peak_end'};
如果($val>$min)和($val<$max)){
#打印$val,“,”位于“,”,$min,“,”和“,”,$max,“\n”之间;
}
否则{
#打印$val,“,”不在“,”,$min,“,”和“,”,$max,“\n”之间;
}
}
}
更棒的代码:
对于循环,您只需要一个
,因为您希望在第二批中找到一些SNP。因此,在%X81\u 05
散列中循环,并检查是否有与%peak
中的一个匹配的散列。比如:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
用于我的$chr1(键%X81\u 05)
{
if(定义为$peaks{$chr1})
{
如果($X81_05{$chr1}{'position'}>$peaks{$chr1}{'peak_start'}
&&$X81_05{$chr1}{'position'}<$peaks{$chr1}{'peak_end'})
{
打印OutputFileHandle$chr1。“\t”
.$peaks{$chr1}{'peak\U start'}.\t“
.$peaks{$chr1}{'peak_end'};
}
其他的
{
打印OutputFileHandle$chr1
.“\t不在两者之间”
.$peaks{$chr1}{'peak_start'}.”和“
.$peaks{$chr1}{'peak_end'};
}
}
}
注意:我没有测试代码
查看您添加的屏幕截图,这是行不通的。您只需要一个循环,因为您希望在第二批中找到一些SNP。因此,在%X81\u 05
散列中循环,并检查是否有与%peak
中的一个匹配的散列。比如:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
用于我的$chr1(键%X81\u 05)
{
if(定义为$peaks{$chr1})
{
如果($X81_05{$chr1}{'position'}>$peaks{$chr1}{'peak_start'}
&&$X81_05{$chr1}{'position'}<$peaks{$chr1}{'peak_end'})
{
打印OutputFileHandle$chr1。“\t”
.$peaks{$chr1}{'peak\U start'}.\t“
.$peaks{$chr1}{'peak_end'};
}
其他的
{
打印OutputFileHandle$chr1
.“\t不在两者之间”
.$peaks{$chr1}{'peak_start'}.”和“
.$peaks{$chr1}{'peak_end'};
}
}
}
注意:我没有测试代码
查看您添加的屏幕截图,这是行不通的。Perl中的几个程序提示:
您可以这样做:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
与此相反:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
它是更标准的Perl,并且更易于阅读
谈到标准Perl,您应该使用3参数形式,并使用标量作为文件句柄:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
我假设是染色体、位置和SNP。文件是怎么排的
你必须明确自己的要求
不管怎样,这里是以制表符分隔格式打印的测试版本。这是一种更现代的Perl格式。注意,我只有一个染色体散列(如您指定的)。我首先阅读了peaks.txt文件。如果我在我的position文件中发现一条染色体,而它在我的peaks.txt
文件中并不存在,我只会忽略它。否则,我将添加POSITION和SNP的附加哈希:
我执行最后一个循环,按照您指定的方式打印所有内容(制表符分隔),但您没有指定格式。如果你必须改变它
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
#/usr/bin/env perl
严格使用;
使用警告;
使用特征qw(例如);
使用自动模具#无需检查文件打开失败
使用常数{
PEAKS_FILE=>“peak.txt”,
POSITION_FILE=>“X81_05.txt”,
};
打开(my$peak_fh),Perl中的几个程序提示:
您可以这样做:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
与此相反:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
它是更标准的Perl,并且更易于阅读
谈到标准Perl,您应该使用3参数形式,并使用标量作为文件句柄:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
我假设是染色体,位置和序列号