Perl 查找一个巨大的ascii文件中包含的最大值和一组最大值（用科学记数法）_Perl_Unix_Awk

Perl 查找一个巨大的ascii文件中包含的最大值和一组最大值（用科学记数法）

perl unix awk

Perl 查找一个巨大的ascii文件中包含的最大值和一组最大值（用科学记数法）,perl,unix,awk,Perl,Unix,Awk,背景：（1）以下是我从大约700Mb的巨大ascii文件中提取的内容： 0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05, 2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05, 1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,

背景：

（1）以下是我从大约700Mb的巨大ascii文件中提取的内容：

0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05,
    2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05,
    1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,
    4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,
    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,
    8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

（2）我想做两件事：

（2.1）在用冒号和分号分隔的数字中找出最大值

在上面提取的行中，它是

5.0003081213

（2.2）找出线中最大的4（表示）值

在上面提取的行中，它是

5.0003081213、0.000421869、0.0003385935和0.0002973858

我的想法：

（3）我希望使用

perl

完成这项工作

（4）我想我可以用

（[0-9.e-]+）

匹配这个数字

我的问题：

（5）但是，我不熟悉

perl

和

unix

，我不知道如何继续查找最大值

（6）我搜索了半天类似的问题，发现我可以使用

List:：Util

。我不知道这对我的问题是一个合适的选择，实际上我也不知道如何采用这个子程序

（7）表示，这些数字包含在名为

input.txt

的文件中。我可以知道是否可以用一行脚本完成任务吗

谢谢你的理解，我非常感谢你的帮助

提出的进一步问题：

多亏了stack overflow用户的热情回复和帮助，我解决了上述问题。但是，如果我只想找出以下数据从第3行到第6行的最大值：

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.193129938e-07, 0, 0, 0, 0, 0, 0,
    0, 2.505016514e-05, 4.835713883e-05, 6.128770648e-05, 1.38018881e-05, 2.303402101e-05,
    0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0.000101613512, 5.451410965e-05, 0, 0, 0, 0, 0.001172270099, 7.088900819e-05, 0,
    1.848198352e-06, 0.0006870109246, 0.00276857581, 0.002038545509, 0.001111047938,
    0.0007607533934, 0.0007915864957, 0.001105735631, 0.001456989534, 0.0007245351113,
    0.0004262289031, 0.0003041285247, 0.0001528418892, 2.332078749e-05, 9.695149464e-05,
    1.004024021e-07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

就是

0, 0, 0, 0, 3.5838803e-05, 0.000104883779, 0, 0, 1.813278467e-05, 0.0001350646297,
    0.0007846746908, 0.001728603877, 0.001082733652, 0.001511217708, 0.0009537032505,
    0.0004436753321, 0.002182536356, 0.0005719495782, 9.055173127e-05, 1.245663419e-05,
    0.0004568318755, 0.0003056741688, 3.186642459e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

那么，如何修改脚本

grep-o'[0-9e.-]*'文件| sort-rg | head-1

是否实现此目的

我知道，通过添加选项

（3,6p）

，命令

sed

可以在文件行上工作。因此，我想知道是否可以通过添加这样的选项来修改上述脚本。再次感谢您的帮助。

awk

可以处理数字，即使是科学记数法。您可以使用以下脚本获得最大值：

awk '{m=(m>$0)?m:$0}END{print m}' RS="[,\n;]" input.file

如果确实要使用单行脚本，则可以使用该脚本获得最大值：

$/=undef;print "largest: " .(sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0] . "\n";

$/=undf；打印“最大：”（排序{$b$a}拆分/，/，标量=~tr/\n；//rd）[0]。“\n”；

这将得到四个最大值：

$/=undef;print join ("," , (sort {$b <=> $a} split /,/ , scalar <> =~ tr/\n ;//rd)[0..3]) . "\n";

$/=undf；打印联接（“，”，（排序{$b$a}拆分/，/，标量=~tr/\n；//rd）[0..3]）。“\n”；

将其中一行保存到文件中，比如sort.pl，然后执行

cat/path/to/input.txt | perl/path/to/sort.pl

尽管它做了应该做的事情，但它不是最漂亮的解决方案。

此解决方案非常冗长，假设您已经知道如何将数据输入程序。不需要使用正则表达式查找数字。你可以用逗号分割，得到一个列表并按大小排序

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'max';

# I'm assuming you already have that data in one line in a variable
my $data = qq{0, 0, 0, 0, 0, 0, 0, 0, 3.043678e-05, 3.661498e-05, 2.070347e-05, 2.47175e-05, 1.49877e-05, 3.031176e-05, 2.12128e-05, 2.817522e-05, 1.802658e-05, 7.192285e-06, 8.467806e-06, 2.047874e-05, 9.621194e-05,4.467542e-05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000421869,    5.0003081213, 0.0001938675, 8.70334e-05, 0.0002973858, 0.0003385935,8.763598e-05, 2.743326e-05, 0, 0.0001043894, 3.409237e-05, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;};

# remove the semicolon
chop $data;

# split to a list on comma and possible whitespace
my @numbers = split /,\s*/, $data;

# this is from List::Util
say 'Max: ' . max(@numbers);

# sort numerical and grab the highest 4
say $_ for ( reverse sort { $a <=> $b } @numbers )[ 0 .. 3 ];

#/usr/bin/perl
严格使用；
使用警告；
使用特征“说”；
使用列表：：Util'max'；
#我假设变量中的一行中已经有了这些数据
我的$data=qq0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，3.04367878E-05，2.0703475 E-05，2.471751755 E-05，1.49877e-05，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，5，5，2.8175755，1，1-05，1，1.80802658-05，1.802658，1.802658-05，7.802658-5，7-5，7-5，7.192282828285，7-06，8-06，8.467878786666-06，8-6-06，8-6，8-06，8-6，8-06，8-6，8-6，2，2，2，2，0.70334e-05,0.0002973858,0.0003385935,8.763598e-05,2.743326e-05,0,0.0001043894,3.409237e-05,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0；}；
#删除分号
切掉$数据；
#拆分为逗号和可能的空格列表
my@numbers=split/，\s*/，$data；
#这是来自List:：Util的
说“Max:”。Max（@number）；
#对数字进行排序，并获取最高的4
对（反向排序{$a$b}@numbers）[0..3]说$；

我理解您的问题，您希望从庞大的输入文件中筛选数字。因此，在分隔符处拆分是不够的，而是需要通过正则表达式提取数字

这是我的尝试：

use strict;
use warnings;

my(@numbers);
while (my $line = <>) {
    while($line =~ m|([-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)|g) {
        push @numbers, $1;
    }
}
@numbers = sort { $b <=> $a } @numbers;

print "largest value:\n  $numbers[0]\n";
print "next four numbers: \n  " . join("\n  ",@numbers[1..4]) . "\n";

使用严格；
使用警告；
我的（@数字）；
while（我的$line=）{
而（$line=~m |（[-+]？[0-9]*\.？[0-9]+（[eE][-+]？[0-9]+）？）| g）{
按@数字，$1；
}
}
@数字=排序{$b$a}@numbers；
打印“最大值：\n$numbers[0]\n”；
打印“下四个数字：\n”。加入（“\n”，@numbers[1..4]）。”\n；

这不是一句台词，但读起来可能更好

像这样使用：

perl findNumbers.pl input.txt

其中

findNumbers.pl

是上面的脚本。

我将使用

grep

和

sort

的组合：

grep -o '[0-9e.-]*' file | sort -rg | head -N

命令
```
grep-o'[0-9e.-]\+'
```
（使用问题中提供的正则表达式）提取文件中的所有数字
然后，
```
sort-g
```
考虑指数值进行排序；通过使用
```
-r
```
我们反转结果，以便顶部的值显示在顶部
最后，
```
head
```
获取前N个值

最高价值：

$ grep -o '[0-9e.-]*' file | sort -rg | head -1
5.0003081213

前4名：

$ grep -o '[0-9e.-]*' file | sort -rg | head -4
5.0003081213
0.000421869
0.0003385935
0.0002973858

更新：一行：

perl -nle 'foreach (split(",|;")) { $_ += 0; @top_n = sort {$b <=> $a} ($_, @top_n); pop @top_n if @top_n > 4; } END { print foreach @top_n; }' input.txt

perl-nle'foreach（split（“，|”）{$\+=0；@top\=sort{$b$a}（$\，@top\&n）；如果@top\&n>4；}结束{print foreach@top\&n；}input.txt，则弹出@top\&n

Nam，其他解决方案很好，我相信，它们已经帮助您解决了问题。但是，它们没有考虑到巨大的输入。即使是lue的解决方案也意味着将整个阵列存储在内存中，并对所有这些数百兆字节执行排序操作。尽管我完全支持lue不重新定义的想法输入记录分隔符和逐行读取。这在处理大型文件时非常有用

这里只有大约5行实际代码，其余的是注释，它们将帮助您理解幕后的情况，并有望帮助您学习一些perl

#!/usr/bin/perl -nl

# 0) The -n from above would make the script read the input line by line
# and the -l parameter would automatically strip off any newline chars
# from input and add a newline to every output line

# 1.1) So, the -n parameter made perl read a line from STDIN and place it
# into $_ variable for you. The following code (excluding the END{} block)
# is executed for every input line.
# 1.2) split() takes this $_ string and breaks it into a series of numbers
# (technically still sub-strings), returning the series as an array
# 1.3) Then foreach loops through this array placing each array's item into
# $_ again. (NB. Yes, we're losing the previous $_'s value which was an input
# string but we don't care about it any longer since we've already processed
# it with split().)
foreach (split(",|;")) {

    # 2) Ensure its stored internally as a numeral by adding zero to it.
    # This would save us a bit of conversion when sorting values and also
    # make final output nicer. Still, you'll get what you want if you
    # comment the following line out.
    $_ += 0;

    # 3.1) Compose a new array by adding the current value ($_) to what
    # we already have (@top_n). The new array is "($_, @top_n)". It's OK
    # if @top_n has nothing in it or even undefined so far, perl will
    # define and initialise it with an empty array when it encounters
    # the @top_n variable first time. (Note: we should better use -w
    # perl command line parameter and define @top_n explicitly beforehand
    # but I'm omitting it here for the sake of simplicity.)
    # 3.2) Then sort the new array. The "$b <=> $a" expression will make
    # it sorted in descending order.
    @top_n = sort {$b <=> $a} ($_, @top_n);

    # 3.3) Finally, throw away the last item (pop does this) if our top-N
    # array has grown beyond the lenth or interest (4 in this example).
    # This helps keeps our sript's memory consumption reasonaably low.
    # Without doing this we'd ended up with several hundreds of megabytes
    # in memory which would require sorting.
    pop @top_n if @top_n > 4;
}

# 4) This block is only executed once, after all the input file is read and
# processed.
END {
    # 4.1) Here our old good foreach reads the @top_n array storing
    # current value in $_ for each iteration.
    # 4.2) Being called without parameters, print() outputs the value
    # of $_ variable. Remember, it also adds a newline to the output
    # - we told it doing so by adding -l in the very first line of the
    # script.
    print foreach @top_n;
}

！/usr/bin/perl-nl
#0）上面的-n将使
#!/usr/bin/perl

use strict;
use warnings;
use List::Util qw ( max );

$/ = ';';

while (<>) {
    s/;//g;
    my @lines = split("\n");
    s/\s+//g;
    my $block_max = max( split(",") );
    last unless defined $block_max;
    print $block_max, "\n";

    my @top;
    foreach my $line (@lines) {
        $line =~ s/\s+//g;
        my @numbers = split( ",", $line );
        my $max_num = max(@numbers);
        if ( defined $max_num ) { push( @top, $max_num ) }
    }

    print "Top 5:\n";
    print join( "\n", ( sort { $b <=> $a } (@top) )[ 0 .. 4 ] );
}

my @numbers = m/[\d+.-]+/g;