PERL：跳转到大型文本文件中的行_Perl_Bigdata

PERL：跳转到大型文本文件中的行

perl

PERL：跳转到大型文本文件中的行,perl,bigdata,Perl,Bigdata,我有一个非常大的文本文件（~4GB）。其结构如下： S=1 3 lines of metadata of block where S=1 a number of lines of data of this block S=2 3 lines of metadata of block where S=2 a number of lines of data of this block S=4 3 lines of metadata of block where S=4 a number of li

我有一个非常大的文本文件（~4GB）。其结构如下：

S=1
3 lines of metadata of block where S=1
a number of lines of data of this block
S=2
3 lines of metadata of block where S=2
a number of lines of data of this block
S=4
3 lines of metadata of block where S=4
a number of lines of data of this block
etc.

我正在写一个PERL程序，读入另一个文件，该文件的每一行（其中必须包含一个数字），在大文件中搜索该数字减去1的S值，然后分析属于该S值的块的数据行

问题是，文本文件很大，因此使用

foreach $line {...} loop

速度很慢。随着S=值的严格增加，是否有任何方法可以跳转到所需S-值的特定行？

如果文本块的长度相同（以字节或字符为单位），您可以计算所需S-值在文件中的位置，然后在那里搜索，然后读取。否则，原则上您需要读取行以查找S值

但是，如果只需要找到几个S值，您可以估计所需的位置和位置，然后足够捕获一个S值。然后分析你读到的内容，看看你离得有多远，然后再次

seek

，或者用

读取行以获得S值

use warnings;
use strict;
use feature 'say';

use Fcntl qw(:seek);

my ($file, $s_target) = @ARGV;
die "Usage: $0 filename\n" if not $file or not -f $file;
$s_target //= 5;  #/ default, S=5

open my $fh, '<', $file or die $!; 

my $est_text_len = 1024;
my $jump_by      = $est_text_len * $s_target;  # to seek forward in file

my ($buff, $found);

seek $fh, $jump_by, SEEK_CUR;  # get in the vicinity

while (1) {

    my $rd = read $fh, $buff, $est_text_len;
    warn "error reading: $!" if not defined $rd;
    last if $rd == 0;

    while ($buff =~ /S=([0-9]+)/g) {
        my $s_val = $1;

        # Analyze $s_val and $buff:
        # (1) if overshot $s_target adjust $jump_by and seek back
        # (2) if in front of $s_target read with <> to get to it
        # (3) if $s_target is in $buff extract needed text

        if ($s_val == $s_target) {
            say "--> Found S=$s_val at pos ", pos $buff, " in buffer";
            seek $fh, - $est_text_len + pos($buff) + 1, SEEK_CUR;
            while (<$fh>) {
                last if /S=[0-9]+/;  # next block
                print $_;
            }
            $found = 1;
            last;
        }
    }   
    last if $found;
}

使用警告；
严格使用；
使用特征“说”；
使用Fcntl qw（：seek）；
我的（$file，$s_target）=@ARGV；
如果不是$file或不是-f$file，则为“用法：$0 filename\n”；
$s_target/=5；#/默认值，S=5
打开我的$fh，'
是否有任何方法可以跳转到所需S值的特定行
是，如果文件未更改，则创建索引。这需要完整地读取文件一次，并使用注意所有S=#
行的位置。键是数字，值是文件中的字节位置。然后你可以使用
但如果要这样做，最好将数据导出到适当的数据库中，例如。编写一个程序将数据插入数据库并添加普通SQL索引。这可能比编写索引更简单。然后，您可以使用普通SQL高效地查询数据，并进行复杂的查询。如果文件发生更改，您可以重新进行导出，或者使用常规的insert
和update
SQL更新数据库。对于任何了解SQL的人来说，使用SQL都很容易，而不是一堆自定义索引和搜索代码。
对第二个文件中的数字进行排序。现在，您可以按顺序处理这个巨大的文件，根据需要处理每个S值。
我知道op已经接受了答案，但一个对我很有用的方法是，根据更改“记录分隔符”（$/）将文件拖入数组
如果您这样做（未经测试，但应该很接近）：
$/=“S=”；
我的@records=；
打印$记录[4]；

输出应该是整个第五条记录（数组从0开始，但数据从1开始），从一行上的记录编号（5）开始（以后可能需要去掉），然后是该记录中的所有剩余行
虽然它是一个内存清管器，但它非常简单和快速。
排序列表的二进制搜索是一个O（logn）操作。使用seek
，类似于以下内容：
open my $fh, '>>+', $big_file;
$target = 123_456_789;

$low = 0;
$high = -s $big_file;

while ($high - $low > 0.01 * -s $big_file) {
    $mid = ($low + $high) / 2;
    seek $fh, $mid, 0;
    while (<$fh>) {
        if (/^S=(\d+)/) {
            if ($1 < $target) { $low = $mid; }
            else              { $high = $mid }
            last;
        }
    }
}

seek $fh, $low, 0;
while (<$fh>) {
    # now you are searching through the 1% of the file that contains
    # your target S
}

打开我的$fh，'>>+'，$big_文件；
$target=123_456_789；
$low=0；
$high=-s$big_文件；
而（$high-$low>0.01*-s$big_文件）{
$mid=（$low+$high）/2；
寻求$fh、$mid、0；
而（）{
如果（/^S=（\d+/）{
如果（$1<$target）{$low=$mid；}
else{$high=$mid}
最后；
}
}
}
寻求$fh、$low、0；
而（）{
#现在，您正在搜索包含的文件的1%
#你的目标是
}
尝试而不是。如果这是不可行的，请将一定大小的MB读入缓冲区并计算换行数，以便找到所需的行。每条记录中的字节数是否相同？是否需要在文件中找到大量的S值（以分析其文本），还是不需要这么多？您可以使用S值/filepos在该文件上创建索引，查找（二进制搜索）索引中的值，然后查找（）到该文件位置。在没有索引的情况下，您可以直接在文件中执行某种二进制搜索—例如，搜索（）到文件的一半，从该pos扫描第一个S，并不断重复，直到到达S。这将需要多次文件读取（log n），而使用索引只需对大文件进行一次读取。这两种解决方案几乎不会为大文件使用任何内存（第一种解决方案中只有索引的大小，第二种解决方案中根本没有）。文件是否会更改？如果没有，请将其转换为更好的格式。+1-第二个示例/答案创建并使用索引（S-value/filepos）也很好。（我不懂Perl，因此无法提供代码）。只要文件不变，索引可以创建一次并存储在磁盘上。由于S已经就绪，创建该索引将很容易（只需继续添加S-value/filepos）。然后只需对内存中较小的索引进行二进制搜索。@Danny\ds确实如此。我假设每次都有一个新文件（少量查询）。添加了一条评论。
open my $fh, '>>+', $big_file;
$target = 123_456_789;

$low = 0;
$high = -s $big_file;

while ($high - $low > 0.01 * -s $big_file) {
    $mid = ($low + $high) / 2;
    seek $fh, $mid, 0;
    while (<$fh>) {
        if (/^S=(\d+)/) {
            if ($1 < $target) { $low = $mid; }
            else              { $high = $mid }
            last;
        }
    }
}

seek $fh, $low, 0;
while (<$fh>) {
    # now you are searching through the 1% of the file that contains
    # your target S
}