Linux 使用由另一列中的值定义的滑动窗口对数值列求和

Linux 使用由另一列中的值定义的滑动窗口对数值列求和,linux,perl,unix,awk,Linux,Perl,Unix,Awk,我现在面临一个问题,就是用另一列中的值定义的滑动窗口对数值列求和 1我的数据以制表符分隔,有两个数字列: 1000 12 2000 10 3000 9 5000 3 9000 5 10000 90 30000 20 31000 32 39000 33 40000 28 2我想用第1列定义的窗口将第2列相加,窗口大小为第1列+3000。这意味着我需要将第3列第3列=第1列行中第2列的所有值添加到第1列+3000 看起来是这样的: 1000 12 12+10+9 2000 10 10+9+3 30

我现在面临一个问题,就是用另一列中的值定义的滑动窗口对数值列求和

1我的数据以制表符分隔,有两个数字列:

1000 12
2000 10
3000 9
5000 3
9000 5
10000 90
30000 20
31000 32
39000 33
40000 28
2我想用第1列定义的窗口将第2列相加,窗口大小为第1列+3000。这意味着我需要将第3列第3列=第1列行中第2列的所有值添加到第1列+3000

看起来是这样的:

1000 12 12+10+9
2000 10 10+9+3
3000 9 9
5000 3 3
9000 5 5+90
10000 90 90
30000 20 20+32
31000 32 32
39000 33 33
40000 28
我是编程新手。我试过awk,但失败了

我不知道如何控制第一列的窗口
awk'i=1;我对awk不是很在行,但是这里有一些我用perl编写的东西,如果您在unix系统上,它也应该运行。假设将其保存为名为window.pl的文件:

#!/usr/bin/perl -w
use strict;

# Usage: window.pl < [filepath or text stream]
# Example: window.pl < window.txt

my $window = 3000;
my @lines = <STDIN>;
my $i = 0;
my $last_line = $#lines;

# Start reading each line
while ($i<= $last_line)
{
    my $current_line = $lines[$i];
    my ($col1, $col2) = ( $current_line =~ /(\d+)\s+(\d+)/ );
    my $ubound = $col1 + $window;
    my @sums = $col2;
    my $lookahead = $i + 1;

    # Start looking at subsequent lines within the window
    while ($lookahead <= $last_line)
    {
        my $next_line = $lines[$lookahead];
        my ($c1, $c2) = ( $next_line =~ /(\d+)\s+(\d+)/ );
        if ($c1 <= $ubound)
        {
            push @sums, $c2;
            ++$lookahead;
        }
        else
        {
            last;
        }
    }

    my $output;
    if ( $#sums > 0 )
    {
        my $sum = join "+", @sums;
        $output = "$col1 $sum\n";
    }
    else
    {
        $output = "$col1 $col2\n";
    }
    print $output;
    ++$i;
}
只有当输入文件足够小,可以读入内存时,这才有效,但这可能会对您有所帮助


祝你好运

这并不是任何一种语言真正擅长的事情,事实上,你所问的是一个相当具有挑战性的编程任务,尤其是对于新手来说

尽管如此,这里还是为您提供了一个awk脚本:

BEGIN {
    window = 3000;
}

function push(line, sum,   n) {
    n = length(lines);
    lines[n] = line;
    sums[n] = sum;
}

function pop(  n, i) {
    n = length(lines);

    if (n > 1) {
        for(i = 0; i < n - 1; i++) {
            lines[i] = lines[i + 1];
            sums[i] = sums[i + 1];
        }
    }
    if (n > 0) {
        delete lines[n - 1];
        delete sums[n - 1];
    }
}

{
    cur_line = $1;
    value = $2;
    n = length(lines);
    pops = 0;
    for (i = 0; i < n; i++) {
        if (lines[i] + window < cur_line) {
            print "Sum for " lines[i] " = " sums[i];
            pops++;
         }
    }
    for (i = 0; i < pops; i++) {
        pop();
    }
    push(cur_line, 0);
    n = length(lines);
    for (i = 0; i < n; i++) {
        sums[i] = sums[i] + value;
    }
}

END {
    n = length(lines);
    for (i = 0; i < n; i++) {
        if (lines[i] < cur_line + window) {
            print "Sum for " lines[i] " = " sums[i];
         }
    }
}

下面是一个更紧凑的解决方案:

#!/usr/bin/perl
use strict;
use warnings;

use constant WIN_SIZE => 3000;

my @pending;

while (<>) {
    my ($pos, $val) = split;

    # Store line info, sum, and when to stop summing
    push @pending, { pos   => $pos,
                     val   => $val,
                     limit => $pos + WIN_SIZE,
                     sum   => 0 };

    show($_)   for grep { $_->{limit} <  $pos } @pending; # Show items beyond window

    @pending =     grep { $_->{limit} >= $pos } @pending; # Keep items still in window

    $_->{sum} += $val for @pending;                       # And continue their sums
}

# and don't forget those items left within the window when the data ran out
show($_) for @pending;

sub show {
    my $pending = shift;
    print join("\t", $pending->{pos}, $pending->{val}, $pending->{sum}), "\n";
}
use warnings;
use strict;

my (%data, @ids);
while (<DATA>) { # read in the data
    /^(\d+)\s+(\d+)$/ or die "bad input: $_";
    push @ids, $1;
    $data{$1} = [$2]
}
for (0 .. $#ids) { # slide window over data
    my ($i, $id) = ($_ + 1, $ids[$_]);

    push @{$data{$id}}, $data{ $ids[$i++] }[0]
        while $i < @ids and $ids[$i] <= $id + 3000;
}

$" = '+';                                                               #"
print "$_: @{$data{$_}}\n" for @ids;

__DATA__
1000 12
2000 10
3000 9
5000 3
9000 5
10000 90
30000 20
31000 32
39000 33
40000 28

下面是一个Perl解决方案:

#!/usr/bin/perl
use strict;
use warnings;

use constant WIN_SIZE => 3000;

my @pending;

while (<>) {
    my ($pos, $val) = split;

    # Store line info, sum, and when to stop summing
    push @pending, { pos   => $pos,
                     val   => $val,
                     limit => $pos + WIN_SIZE,
                     sum   => 0 };

    show($_)   for grep { $_->{limit} <  $pos } @pending; # Show items beyond window

    @pending =     grep { $_->{limit} >= $pos } @pending; # Keep items still in window

    $_->{sum} += $val for @pending;                       # And continue their sums
}

# and don't forget those items left within the window when the data ran out
show($_) for @pending;

sub show {
    my $pending = shift;
    print join("\t", $pending->{pos}, $pending->{val}, $pending->{sum}), "\n";
}
use warnings;
use strict;

my (%data, @ids);
while (<DATA>) { # read in the data
    /^(\d+)\s+(\d+)$/ or die "bad input: $_";
    push @ids, $1;
    $data{$1} = [$2]
}
for (0 .. $#ids) { # slide window over data
    my ($i, $id) = ($_ + 1, $ids[$_]);

    push @{$data{$id}}, $data{ $ids[$i++] }[0]
        while $i < @ids and $ids[$i] <= $id + 3000;
}

$" = '+';                                                               #"
print "$_: @{$data{$_}}\n" for @ids;

__DATA__
1000 12
2000 10
3000 9
5000 3
9000 5
10000 90
30000 20
31000 32
39000 33
40000 28
其中打印:

1000: 12+10+9 2000: 10+9+3 3000: 9+3 5000: 3 9000: 5+90 10000: 90 30000: 20+32 31000: 32 39000: 33+28 40000: 28
我在示例数据中只看到一行。也许您可以重新格式化它,使您的示例数据符合您的问题。使用代码括号强制格式化。对不起,大卫。我不知道什么是代码括号。我试过了,但失败了。我的数据中有两个栏位。非常感谢你在编辑这篇文章方面的帮助,大卫。不客气。我还是不明白你是如何定义你的窗口的。也许您可以添加一个更新来澄清。第一列加3000是什么意思?您的第二个表是否已发布,如下所示:您所需的输出?它似乎与定义没有任何关系。@DavidO,让F[1]表示第一列第1行,S[1]表示第二列第1行。他想计算S[A..B]之和,其中F[B+1]是大于F[A]+3000的第一个值。所以第一行是12+10+9,因为1000+3000=4000,5000是大于4000的第一个值。谢谢。我真的很喜欢awk脚本。当我尝试测试它时,我遇到了一个问题。我把你的脚本保存在一个文件中。并使用命令行:awk'{print$2,$3}'test | awk-f slidedu awk.awk。返回的错误是:awk:slided\u awk.awk:7:FILENAME=-FNR=1致命:尝试使用标量“行”,因为arrayI会检查代码的版本。它的意思是,不知何故,变量行被分配了一个正常值,一个字符串,或一个数字,而不是数组。如果输入错误将行与行变量混淆,则很容易发生这种情况。 1000: 12+10+9 2000: 10+9+3 3000: 9+3 5000: 3 9000: 5+90 10000: 90 30000: 20+32 31000: 32 39000: 33+28 40000: 28