Perl/Linux过滤包含其他文件内容的大文件_Linux_Perl_Awk

Perl/Linux过滤包含其他文件内容的大文件

linux perl awk

Perl/Linux过滤包含其他文件内容的大文件,linux,perl,awk,Linux,Perl,Awk,我正在使用另一个较小文件的内容过滤一个580 MB的文件。文件1（较小的文件）文件2（大文件）如果满足以下条件，我想从文件2中捕获行。 File2.Chr==File1.Chr&&File2.Pos>File1.Start&&File2.Posa[$1][i][0]&&\ $2a[$1][i][0]&$2您在循环中调用awk两次。难怪速度慢。对python解决方案感兴趣？当然，一直想学习python。thanks@Jean-弗朗索瓦·法布实际上只是第二条线（$cmd2=…）调用awk。$c

我正在使用另一个较小文件的内容过滤一个580 MB的文件。文件1（较小的文件）

文件2（大文件）

如果满足以下条件，我想从文件2中捕获行。

File2.Chr==File1.Chr&&File2.Pos>File1.Start&&File2.Pos


我尝试过使用awk，但它运行得很慢，我想知道是否有更好的方法来完成同样的任务
多谢各位
以下是我正在使用的代码：
#!/usr/bin/perl -w
use strict;
use warnings;

my $bed_file = "/data/1000G/Hotspots.bed";#File1 smaller file
my $SNP_file = "/data/1000G/SNP_file.txt";#File2 larger file
my $final_file = "/data/1000G/final_file.txt"; #final output file

open my $in_fh, '<', $bed_file
        or die qq{Unable to open "$bed_file" for input: $!};

    while ( <$in_fh> ) {

     my $line_str = $_;

     my @data = split(/\t/, $line_str);

     next if /\b(?:track)\b/;# skip header line
     my $chr = $data[0]; $chr =~ s/chr//g; print "chr is $chr\n";
     my $start = $data[1]-1; print "start is $start\n";
     my $end = $data[2]+1; print "end is $end\n";

     my $cmd1 = "awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file"; print "cmd1\n";
     my $cmd2 = `awk '{if(\$1==chr && \$2>$start && \$2</$end) print (\"chr\"\$1\"_\"\$2\"_\"\$3\"_\"\$4\"_\"\$5\"_\"\$6\"_\"\$7\"_\"\$8)}' $SNP_file >> $final_file`; print "cmd2\n";

}

#/usr/bin/perl-w
严格使用；
使用警告；
my$bed_file=“/data/1000G/Hotspots.bed”#文件1较小的文件
my$SNP_file=“/data/1000G/SNP_file.txt”#文件2较大的文件
my$final_file=“/data/1000G/final_file.txt”#最终输出文件
打开my$in_fh，'$start&&\$2$start&&\$2将小文件读入数据结构，并对照它检查其他文件的每一行
在这里，我将它读入一个数组，每个元素都是一个arrayref，其中包含一行中的字段。然后，根据该数组中的arrayrefs检查数据文件的每一行，根据需求比较字段
use warnings 'all';
use strict;

my $ref_file = 'reference.txt';
open my $fh, '<', $ref_file or die "Can't open $ref_file: $!";
my @ref = map { chomp; [ split ] } grep { /\S/ } <$fh>;

my $data_file = 'data.txt';
open $fh, '<', $data_file or die "Can't open $data_file: $!";

# Drop header lines
my $ref_header  = shift @ref;    
my $data_header = <$fh>;

while (<$fh>) 
{
    next if not /\S/;  # skip empty lines
    my @line = split;

    foreach my $refline (@ref) 
    {
        next if $line[0] != $refline->[0];
        if ($line[1] > $refline->[1] and $line[1] < $refline->[2]) {
            print "@line\n";
        }
    }   
}
close $fh;

使用警告“全部”；
严格使用；
我的$ref_文件='reference.txt'；
打开我的$fh，“如前所述，每次迭代调用awk
都非常慢。一个完整的awk
解决方案是可能的，我刚刚看到了一个Perl解决方案，这里是我的Python解决方案，因为OP不会介意：

从小文件创建一个字典：chr=>夫妇列表开始/结束
遍历大文件并尝试匹配chr&开始/结束元组之间的位置

代码：
打开（“smallfile.txt”）作为f：
下一步（f）#跳过标题
#构建一个以chr为键的字典，并将start和end列为值
d=集合.defaultdict（列表）
对于f中的行：
toks=line.split（）
如果len（toks）==3：
d[int（toks[0]）].追加（（int（toks[1]），int（toks[2]））
将open（“largefile.txt”）作为f：
下一步（f）#跳过标题
对于f中的行：
toks=line.split（）
chr_tok=int（toks[0]）
如果chr_tok在d中：
#钥匙在字典里
pos=int（toks[1]）
如果有（lambda x:t[0]另一种方式，这一次将较小的文件存储在基于“chr”字段的数组散列（HoA）中：
use strict;
use warnings;

my $small_file = 'small.txt';
my $large_file = 'large.txt';

open my $small_fh, '<', $small_file or die $!;

my %small;

while (<$small_fh>){
    next if $. == 1;
    my ($chr, $start, $end) = split /\s+/, $_;
    push @{ $small{$chr} }, [$start, $end];
}

close $small_fh;

open my $large_fh, '<', $large_file or die $!;

while (my $line = <$large_fh>){
    my ($chr, $pos) = (split /\s+/, $line)[0, 1];

    if (defined $small{$chr}){
        for (@{ $small{$chr} }){
            if ($pos > $_->[0] && $pos < $_->[1]){
                print $line;
            }
        }
    }
}

使用严格；
使用警告；
我的$small_文件='small.txt'；
我的$large_文件='large.txt'；
打开我的$small_fh，“将它们放入SQLite数据库，进行连接。这将比尝试自己编写一些东西更快、更少的错误，使用更少的内存。而且它更灵活，现在您只需对数据执行SQL查询，无需继续编写新脚本和重新分析文件
您可以通过自己解析和插入来导入它们，也可以将它们转换为CSV并使用。使用这些简单的数据转换为CSV可以像s{+}{，}g
一样简单，或者您可以使用全面且非常快速的
您的表如下所示（您需要为表和字段使用更好的名称）
在要搜索的列上创建一些索引。索引会减慢导入速度，因此请确保在导入后执行此操作
create index chr_file1 on file1 (chr);
create index chr_file2 on file2 (chr);
create index pos_file2 on file2 (pos);
create index start_file1 on file1 (start);
create index end_file1 on file1 (end);

然后加入
select *
from file2
join file1 on file1.chr == file2.chr
where file2.pos between file1.start and file1.end;

1,124,r2,3,s,4,s,2,s,2,1,123,150
2,455,t2,4,2,4,t,3,w,3,2,450,600

您可以通过Perl和驱动程序执行此操作。
awk power和单次传递。您的代码对文件2的迭代次数与文件1中的行数相同，因此执行时间呈线性增加。请告知此单次传递解决方案是否比其他解决方案慢
awk 'NR==FNR {
    i = b[$1];        # get the next index for the chr
    a[$1][i][0] = $2; # store start
    a[$1][i][1] = $3; # store end
    b[$1]++;          # increment the next index
    next;
}

{
    p = 0;
    if ($1 in a) {
        for (i in a[$1]) {
            if ($2 > a[$1][i][0] && \
                $2 < a[$1][i][1])
                p = 1                 # set p if $2 in range
        }
    }
}

p {print}'

awk'NR==FNR{
i=b[$1]#获取chr的下一个索引
a[$1][i][0]=$2；#开店
a[$1][i][1]=$3；#店尾
b[$1]++#增加下一个索引
下一个
}
{
p=0；
如果（一年1美元）{
（我在[1美元]内）{
如果（$2>a[$1][i][0]&&\
$2

一行
awk 'NR==FNR {i = b[$1];a[$1][i][0] = $2; a[$1][i][1] = $3; b[$1]++;next; }{p = 0;if ($1 in a){for(i in a[$1]){if($2>a[$1][i][0] && $2<a[$1][i][1])p=1}}}p' file1 file2

awk'NR==FNR{i=b[$1]；a[$1][i][0]=$2；a[$1][i][1]=$3；b[$1++；next；}{p=0；if（$1 in a）{for（i in a[$1]）{if（$2>a[$1][i][0]&$2您在循环中调用awk
两次。难怪速度慢。对python解决方案感兴趣？当然，一直想学习python。thanks@Jean-弗朗索瓦·法布实际上只是第二条线（$cmd2=…
）调用awk
。$cmd1=…
行只设置一个字符串变量。我们可以从使用的不同引号（“
=assign）与（backtick）（=execute））中看出这一点.但不管怎样，你是对的。我想尝试所有的建议，但无法实现所有建议。我使用了你的建议。感谢你的建议，谢谢。如果以后出现后续问题，请告诉我。CSV导入的链接太旧。我发现了另一个链接：
create index chr_file1 on file1 (chr);
create index chr_file2 on file2 (chr);
create index pos_file2 on file2 (pos);
create index start_file1 on file1 (start);
create index end_file1 on file1 (end);

select *
from file2
join file1 on file1.chr == file2.chr
where file2.pos between file1.start and file1.end;

1,124,r2,3,s,4,s,2,s,2,1,123,150
2,455,t2,4,2,4,t,3,w,3,2,450,600

awk 'NR==FNR {
    i = b[$1];        # get the next index for the chr
    a[$1][i][0] = $2; # store start
    a[$1][i][1] = $3; # store end
    b[$1]++;          # increment the next index
    next;
}

{
    p = 0;
    if ($1 in a) {
        for (i in a[$1]) {
            if ($2 > a[$1][i][0] && \
                $2 < a[$1][i][1])
                p = 1                 # set p if $2 in range
        }
    }
}

p {print}'

awk 'NR==FNR {i = b[$1];a[$1][i][0] = $2; a[$1][i][1] = $3; b[$1]++;next; }{p = 0;if ($1 in a){for(i in a[$1]){if($2>a[$1][i][0] && $2<a[$1][i][1])p=1}}}p' file1 file2