Perl 查找一个巨大的ascii文件中包含的最大值和一组最大值(用科学记数法)

Perl 查找一个巨大的ascii文件中包含的最大值和一组最大值(用科学记数法),perl,unix,awk,Perl,Unix,Awk,背景: (1) 以下是我从大约700Mb的巨大ascii文件中提取的内容: 0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05, 2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05, 1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,

背景:

(1) 以下是我从大约700Mb的巨大ascii文件中提取的内容:

0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05,
    2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05,
    1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,
    4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,
    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,
    8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
(2) 我想做两件事:

(2.1)在用冒号和分号分隔的数字中找出最大值

在上面提取的行中,它是
5.0003081213

(2.2)找出线中最大的4(表示)值

在上面提取的行中,它是
5.0003081213、0.000421869、0.0003385935和0.0002973858


我的想法:

(3) 我希望使用
perl
完成这项工作

(4) 我想我可以用
([0-9.e-]+)
匹配这个数字


我的问题:

(5) 但是,我不熟悉
perl
unix
,我不知道如何继续查找最大值

(6) 我搜索了半天类似的问题,发现我可以使用
List::Util
。我不知道这对我的问题是一个合适的选择,实际上我也不知道如何采用这个子程序

(7) 表示,这些数字包含在名为
input.txt
的文件中。我可以知道是否可以用一行脚本完成任务吗

谢谢你的理解,我非常感谢你的帮助


提出的进一步问题:

多亏了stack overflow用户的热情回复和帮助,我解决了上述问题。但是,如果我只想找出以下数据从第3行到第6行的最大值:

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.193129938e-07, 0, 0, 0, 0, 0, 0,
    0, 2.505016514e-05, 4.835713883e-05, 6.128770648e-05, 1.38018881e-05, 2.303402101e-05,
    0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0.000101613512, 5.451410965e-05, 0, 0, 0, 0, 0.001172270099, 7.088900819e-05, 0,
    1.848198352e-06, 0.0006870109246, 0.00276857581, 0.002038545509, 0.001111047938,
    0.0007607533934, 0.0007915864957, 0.001105735631, 0.001456989534, 0.0007245351113,
    0.0004262289031, 0.0003041285247, 0.0001528418892, 2.332078749e-05, 9.695149464e-05,
    1.004024021e-07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
就是

0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
那么,如何修改脚本
grep-o'[0-9e.-]*'文件| sort-rg | head-1
是否实现此目的


我知道,通过添加选项
(3,6p)
,命令
sed
可以在文件行上工作。因此,我想知道是否可以通过添加这样的选项来修改上述脚本。再次感谢您的帮助。

awk
可以处理数字,即使是科学记数法。您可以使用以下脚本获得最大值:

awk '{m=(m>$0)?m:$0}END{print m}' RS="[,\n;]" input.file

如果确实要使用单行脚本,则可以使用该脚本获得最大值:

$/=undef;print "largest: " .(sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0] . "\n";
$/=undf;打印“最大:”(排序{$b$a}拆分/,/,标量=~tr/\n;//rd)[0]。“\n”;
这将得到四个最大值:

$/=undef;print join ("," , (sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0..3]) . "\n";
$/=undf;打印联接(“,”,(排序{$b$a}拆分/,/,标量=~tr/\n;//rd)[0..3])。“\n”;
将其中一行保存到文件中,比如sort.pl,然后执行
cat/path/to/input.txt | perl/path/to/sort.pl


尽管它做了应该做的事情,但它不是最漂亮的解决方案。

此解决方案非常冗长,假设您已经知道如何将数据输入程序。不需要使用正则表达式查找数字。你可以用逗号分割,得到一个列表并按大小排序

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'max';

# I'm assuming you already have that data in one line in a variable
my $data = qq{0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05, 2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05, 1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;};

# remove the semicolon
chop $data;

# split to a list on comma and possible whitespace
my @numbers = split /,\s*/, $data;

# this is from List::Util
say 'Max: ' . max(@numbers);

# sort numerical and grab the highest 4
say $_ for ( reverse sort { $a <=> $b } @numbers )[ 0 .. 3 ];
#/usr/bin/perl
严格使用;
使用警告;
使用特征“说”;
使用列表::Util'max';
#我假设变量中的一行中已经有了这些数据
我的$data=qq0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.04367878E-05,2.0703475 E-05,2.471751755 E-05,1.49877e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,2.8175755,1,1-05,1,1.80802658-05,1.802658,1.802658-05,7.802658-5,7-5,7-5,7.192282828285,7-06,8-06,8.467878786666-06,8-6-06,8-6,8-06,8-6,8-06,8-6,8-6,2,2,2,2,0.70334e-05,0.0002973858,0.0003385935,8.763598e-05,2.743326e-05,0,0.0001043894,3.409237e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;};
#删除分号
切掉$数据;
#拆分为逗号和可能的空格列表
my@numbers=split/,\s*/,$data;
#这是来自List::Util的
说“Max:”。Max(@number);
#对数字进行排序,并获取最高的4
对(反向排序{$a$b}@numbers)[0..3]说$;

我理解您的问题,您希望从庞大的输入文件中筛选数字。因此,在分隔符处拆分是不够的,而是需要通过正则表达式提取数字

这是我的尝试:

use strict;
use warnings;

my(@numbers);
while (my $line = <>) {
    while($line =~ m|([-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)|g) {
        push @numbers, $1;
    }
}
@numbers = sort { $b <=> $a } @numbers;

print "largest value:\n  $numbers[0]\n";
print "next four numbers: \n  " . join("\n  ",@numbers[1..4]) . "\n";
使用严格;
使用警告;
我的(@数字);
while(我的$line=){
而($line=~m |([-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)| g){
按@数字,$1;
}
}
@数字=排序{$b$a}@numbers;
打印“最大值:\n$numbers[0]\n”;
打印“下四个数字:\n”。加入(“\n”,@numbers[1..4])。”\n;
这不是一句台词,但读起来可能更好


像这样使用:
perl findNumbers.pl input.txt
其中
findNumbers.pl
是上面的脚本。

我将使用
grep
sort
的组合:

grep -o '[0-9e.-]*' file | sort -rg | head -N
  • 命令
    grep-o'[0-9e.-]\+'
    (使用问题中提供的正则表达式)提取文件中的所有数字
  • 然后,
    sort-g
    考虑指数值进行排序;通过使用
    -r
    我们反转结果,以便顶部的值显示在顶部
  • 最后,
    head
    获取前N个值
最高价值:

$ grep -o '[0-9e.-]*' file | sort -rg | head -1
5.0003081213
前4名:

$ grep -o '[0-9e.-]*' file | sort -rg | head -4
5.0003081213
0.000421869
0.0003385935
0.0002973858
更新:一行:

perl -nle 'foreach (split(",|;")) { $_ += 0; @top_n = sort {$b <=> $a} ($_, @top_n); pop @top_n if @top_n > 4; } END { print foreach @top_n; }' input.txt
perl-nle'foreach(split(“,|”){$\+=0;@top\=sort{$b$a}($\,@top\&n);如果@top\&n>4;}结束{print foreach@top\&n;}input.txt,则弹出@top\&n
Nam,其他解决方案很好,我相信,它们已经帮助您解决了问题。但是,它们没有考虑到巨大的输入。即使是lue的解决方案也意味着将整个阵列存储在内存中,并对所有这些数百兆字节执行排序操作。尽管我完全支持lue不重新定义的想法输入记录分隔符和逐行读取。这在处理大型文件时非常有用

这里只有大约5行实际代码,其余的是注释,它们将帮助您理解幕后的情况,并有望帮助您学习一些perl

#!/usr/bin/perl -nl

# 0) The -n from above would make the script read the input line by line
# and the -l parameter would automatically strip off any newline chars
# from input and add a newline to every output line

# 1.1) So, the -n parameter made perl read a line from STDIN and place it
# into $_ variable for you. The following code (excluding the END{} block)
# is executed for every input line.
# 1.2) split() takes this $_ string and breaks it into a series of numbers
# (technically still sub-strings), returning the series as an array
# 1.3) Then foreach loops through this array placing each array's item into
# $_ again. (NB. Yes, we're losing the previous $_'s value which was an input
# string but we don't care about it any longer since we've already processed
# it with split().)
foreach (split(",|;")) {

    # 2) Ensure its stored internally as a numeral by adding zero to it.
    # This would save us a bit of conversion when sorting values and also
    # make final output nicer. Still, you'll get what you want if you
    # comment the following line out.
    $_ += 0;

    # 3.1) Compose a new array by adding the current value ($_) to what
    # we already have (@top_n). The new array is "($_, @top_n)". It's OK
    # if @top_n has nothing in it or even undefined so far, perl will
    # define and initialise it with an empty array when it encounters
    # the @top_n variable first time. (Note: we should better use -w
    # perl command line parameter and define @top_n explicitly beforehand
    # but I'm omitting it here for the sake of simplicity.)
    # 3.2) Then sort the new array. The "$b <=> $a" expression will make
    # it sorted in descending order.
    @top_n = sort {$b <=> $a} ($_, @top_n);

    # 3.3) Finally, throw away the last item (pop does this) if our top-N
    # array has grown beyond the lenth or interest (4 in this example).
    # This helps keeps our sript's memory consumption reasonaably low.
    # Without doing this we'd ended up with several hundreds of megabytes
    # in memory which would require sorting.
    pop @top_n if @top_n > 4;
}

# 4) This block is only executed once, after all the input file is read and
# processed.
END {
    # 4.1) Here our old good foreach reads the @top_n array storing
    # current value in $_ for each iteration.
    # 4.2) Being called without parameters, print() outputs the value
    # of $_ variable. Remember, it also adds a newline to the output
    # - we told it doing so by adding -l in the very first line of the
    # script.
    print foreach @top_n;
}
!/usr/bin/perl-nl
#0)上面的-n将使
#!/usr/bin/perl

use strict;
use warnings;
use List::Util qw ( max );

$/ = ';';

while (<>) {
    s/;//g;
    my @lines = split("\n");
    s/\s+//g;
    my $block_max = max( split(",") );
    last unless defined $block_max;
    print $block_max, "\n";

    my @top;
    foreach my $line (@lines) {
        $line =~ s/\s+//g;
        my @numbers = split( ",", $line );
        my $max_num = max(@numbers);
        if ( defined $max_num ) { push( @top, $max_num ) }
    }

    print "Top 5:\n";
    print join( "\n", ( sort { $b <=> $a } (@top) )[ 0 .. 4 ] );
}
my @numbers = m/[\d+.-]+/g;