如何使用Perl在一个大文件中以最佳方式移动顶部具有特定模式的行？_Perl_File_Csv_Seek_Fileparsing

如何使用Perl在一个大文件中以最佳方式移动顶部具有特定模式的行？

perl file csv

如何使用Perl在一个大文件中以最佳方式移动顶部具有特定模式的行？,perl,file,csv,seek,fileparsing,Perl,File,Csv,Seek,Fileparsing,我有一个近20k行的巨大csv文件，格式如下： file,tools,edit,syntax,buffers a,b,c,perl,d a,w,c33,java,d a,e,c,perl,d a,s,c,python,d1 a,n,c,php,d3 d,r,hhh,cpp,d0 d,m,hhh,c#,d0 a,o,c,pdf,d3 a,f,c,python,dd a,h,c,perl,dg a,yb,c,c,ddf a,b,c,perl,dt wa,b,c33,java,d d,buuu,hhh

我有一个近20k行的巨大csv文件，格式如下：

file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

我需要在顶部放置两行具有相同语法的模式（即第4列）。然后，其余的线路将按原样存在。这意味着前两行的语法是“perl”，然后是“java”、“python”等等

到目前为止，我已经使用seek和tell编写了下面的代码，以使其得到优化。然而，它并没有像预期的那样发挥作用

use strict;
use warnings;

open(FP, "+<mycsv.csv");

my %hash = ();
my $cur_pos;    


while(<FP>) {

    my $line = $_;
    chomp $line;
    #print "$line aaa\n";
    if($line =~ /^file\,tools,/) {next;}

    if($line =~ /^\w+\,\w+\,\w+,(\w+)\,.*$/) {
        my $type = $1;
        #print "type $type\n";

    if($hash{$type}->{count} < 2 ) {
        #print "--- here type = $type | lastpos = ", $hash{$type}->{lastpos} , "\n";
        $cur_pos = tell(FP);
        my $pos = tell(FP) - length($line); 
        if($hash{$type}->{lastpos} ) {

            my $lastpos = $hash{$type}->{lastpos};
            seek(FP, $lastpos, 1);
            print FP $line;
            seek(FP, $cur_pos, 1);
        } 

        $hash{$type}->{lastpos} = $pos;


    }
        if(exists $hash{$type} ) {
            $hash{$type}->{count} += 1;
        } else {
            $hash{$type}->{count} = 1;
        }


    }
}


close(FP);

任何帮助，使其工作将不胜感激

谢谢。

对于相同的逻辑，我得到的输出与您的略有不同。你能检查一下这个输出并告诉我是否需要更改吗？在评论中提到了这种方法

use strict;
use warnings;
use feature 'say';
my $syntax = [];
my $NUM = 2;   # change number if needed
my $filename = 'file.txt';
my $data = {};  # make a hash of data

open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
    chomp $row;
    next if $. == 1; # skip header row
    my @columns = split (',', $row);

    push @$syntax, $columns[3];   # make a list of all syntaxes available
    push @{$data->{$columns[3]}}, $row;
}
close $fh;

my $processed = {};
# loop throught the syntax array and print data from hash
# also, make a counter of the number of times that syntax is used.
# it will help us to skip next (n-1) occurence of that syntax
for my $syntax (@$syntax) {
    if (!$processed->{$syntax}){
        for my $s (splice @{$data->{$syntax}}, 0, $NUM) {
            $processed->{$syntax} += 1;
            say $s;
        }
    } else {
        $processed->{$syntax} -= 1;
    }
}
# print out the remaining values
for my $rem (values %$data){
    say for @$rem;    
}

对于相同的逻辑，我得到的输出与你的略有不同。你能检查一下这个输出并告诉我是否需要更改吗？在评论中提到了这种方法

use strict;
use warnings;
use feature 'say';
my $syntax = [];
my $NUM = 2;   # change number if needed
my $filename = 'file.txt';
my $data = {};  # make a hash of data

open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
    chomp $row;
    next if $. == 1; # skip header row
    my @columns = split (',', $row);

    push @$syntax, $columns[3];   # make a list of all syntaxes available
    push @{$data->{$columns[3]}}, $row;
}
close $fh;

my $processed = {};
# loop throught the syntax array and print data from hash
# also, make a counter of the number of times that syntax is used.
# it will help us to skip next (n-1) occurence of that syntax
for my $syntax (@$syntax) {
    if (!$processed->{$syntax}){
        for my $s (splice @{$data->{$syntax}}, 0, $NUM) {
            $processed->{$syntax} += 1;
            say $s;
        }
    } else {
        $processed->{$syntax} -= 1;
    }
}
# print out the remaining values
for my $rem (values %$data){
    say for @$rem;    
}

我将通过解析文件来收集数据结构中的第一对行，并将其他行发送到临时文件来实现这一点。解析完文件后，将数据结构中的成对行打印到输出文件中，然后将临时文件添加到输出文件的末尾

示例代码：

use strict;
use warnings;
use feature ':5.16';

my $infile = 'infile';
my $outfile = 'outfile';
my $tempfile = 'temp';
my $quantity = 2;  # or whatever

open my $in, '<', $infile or die 'Could not open infile: ' . $!;
open my $out, '>', $outfile or die 'Could not create output file: ' . $!;
open my $temp, '>', $tempfile or die 'Could not create tempfile: ' . $!;

my $hash = {};
my @order;
my $hdr;

while ( <$in> ) {
  if ( $hdr ) {
    my @cols = split ",", $_;
    my $key = $cols[3];

    # have we seen this key before?
    if ( ! $hash->{$key} ) {
      push @order, $key;
      $hash->{$key} = [ $_ ];
    }
    elsif ( scalar @{$hash->{$key}} < $quantity ) {
      push @{$hash->{$key}}, $_;
    }
    else {
      print { $temp } $_;
    }
  }
  else {
    # the header line
    print { $out } $_;
    $hdr = $_;
  }
}

# print the collected twofers out into the tempfile
for my $key ( @order ) {
  print { $out } @{$hash->{$key}};
}
close $out;
close $temp;

# concatenate the files
system join ' ', ( 'cat', $tempfile, '>>', $outfile );

使用严格；
使用警告；
使用特征“：5.16”；
my$infle='infle'；
我的$outfile='outfile'；
我的$tempfile='temp'；
我的$quantity=2；#或者别的什么
打开我的$in，“，$outfile或die”无法创建输出文件：“。$！；
打开我的$temp“>”，$tempfile或die“无法创建tempfile:”。$！；
我的$hash={}；
我的订单；
我的$hdr；
而（）{
如果（$hdr）{
my@cols=split“，”，$；
my$key=$cols[3]；
#我们以前见过这把钥匙吗？
if（！$hash->{$key}）{
按@order，$key；
$hash->{$key}=[$\uuz]；
}
elsif（标量@{$hash->{$key}}<$quantity）{
push@{$hash->{$key}，$\；
}
否则{
打印{$temp}$\；
}
}
否则{
#标题行
打印{$out}$\；
$hdr=$\ux；
}
}
#将收集的两张照片打印到tempfile中
我的$key（@order）{
打印{$out}@{$hash->{$key}；
}
收尾美元；
关闭$temp；
#连接文件
系统联接“”，（'cat'，$tempfile，'>>'，$outfile）；

如果成对的行不必按照它们在源文件中出现的顺序排列，您可以跳过

@order

的内容。

示例代码：

use strict;
use warnings;
use feature ':5.16';

my $infile = 'infile';
my $outfile = 'outfile';
my $tempfile = 'temp';
my $quantity = 2;  # or whatever

open my $in, '<', $infile or die 'Could not open infile: ' . $!;
open my $out, '>', $outfile or die 'Could not create output file: ' . $!;
open my $temp, '>', $tempfile or die 'Could not create tempfile: ' . $!;

my $hash = {};
my @order;
my $hdr;

while ( <$in> ) {
  if ( $hdr ) {
    my @cols = split ",", $_;
    my $key = $cols[3];

    # have we seen this key before?
    if ( ! $hash->{$key} ) {
      push @order, $key;
      $hash->{$key} = [ $_ ];
    }
    elsif ( scalar @{$hash->{$key}} < $quantity ) {
      push @{$hash->{$key}}, $_;
    }
    else {
      print { $temp } $_;
    }
  }
  else {
    # the header line
    print { $out } $_;
    $hdr = $_;
  }
}

# print the collected twofers out into the tempfile
for my $key ( @order ) {
  print { $out } @{$hash->{$key}};
}
close $out;
close $temp;

# concatenate the files
system join ' ', ( 'cat', $tempfile, '>>', $outfile );

使用严格；
使用警告；
使用特征“：5.16”；
my$infle='infle'；
我的$outfile='outfile'；
我的$tempfile='temp'；
我的$quantity=2；#或者别的什么
打开我的$in，“，$outfile或die”无法创建输出文件：“。$！；
打开我的$temp“>”，$tempfile或die“无法创建tempfile:”。$！；
我的$hash={}；
我的订单；
我的$hdr；
而（）{
如果（$hdr）{
my@cols=split“，”，$；
my$key=$cols[3]；
#我们以前见过这把钥匙吗？
if（！$hash->{$key}）{
按@order，$key；
$hash->{$key}=[$\uuz]；
}
elsif（标量@{$hash->{$key}}<$quantity）{
push@{$hash->{$key}，$\；
}
否则{
打印{$temp}$\；
}
}
否则{
#标题行
打印{$out}$\；
$hdr=$\ux；
}
}
#将收集的两张照片打印到tempfile中
我的$key（@order）{
打印{$out}@{$hash->{$key}；
}
收尾美元；
关闭$temp；
#连接文件
系统联接“”，（'cat'，$tempfile，'>>'，$outfile）；

如果成对的行不必按照它们在源文件中的显示顺序，您可以跳过

@order

内容

我有一个近20k行的巨大CSV文件，格式如下：

file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

这一点都不算大。文件大小可能约为1兆字节

虽然我通常建议逐行处理以确保文件大小的健壮性，但在这种情况下，您知道您正在处理的文件很小。问题是你花在优化这件事上的时间是否值得

如果我理解正确，您的问题可以通过浪费一些内存快速解决（在程序员时间内）：

#!/usr/bin/env perl

use strict;
use warnings;
use List::Util qw( uniqstr );

my $TOP = 2;

(my $header = <DATA>) =~ s/\s+\z//;
my @header = split /,|\s+/, $header;
my %idx = map +($header[$_] => $_), 0 .. $#header;

my @lines = grep /\S/, <DATA>;
my %syntax_of = map +($_ => (split /,/, $_)[$idx{syntax}]), @lines;

my @syntaxes = uniqstr map $syntax_of{$_}, @lines;

my %lines_of;
for my $n (0 .. $#lines) {
    push @{$lines_of{$syntax_of{$lines[$n]}}}, $n;
}

print "$header\n";

for my $syntax (@syntaxes) {
    my @top = grep defined, map $lines_of{$syntax}->[$_ - 1], 1 .. $TOP;
    print @lines[@top];
    # normally, invoking delete on an array slice is not
    # but it is just what we need here.
    delete @lines[@top];
}

print grep defined, @lines;

__DATA__
file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

#/usr/bin/env perl
严格使用；
使用警告；
使用列表：：Util qw（uniqstr）；
我的$TOP=2；
（my$header=）=~s/\s+\z/；
my@header=split/，|\s+/，$header；
我的%idx=map+（$header[$\u]=>$\u0$#收割台；
我的@lines=grep/\S/；
我的%syntax_of=map+（$\=>（拆分/，/，$）[$idx{syntax}]），@行；
my@syntaxes=uniqstr map$syntax\u of{$\u}，@行；
我的%u行；
对于我的$n（0..$#行）{
推送{$行{$语法{$行[$n]}}}}，$n的{$行}；
}
打印“$header\n”；
对于我的$syntax（@syntax）{
my@top=grep-defined，映射{$syntax}->[$\-1]的$lines\u，1..$top；
打印@行[@top]；
#通常情况下，在数组片上调用delete是不正确的
#但这正是我们在这里需要的。
删除@行[@top]；
}
打印定义的grep，@行；
__资料__
文件、工具、编辑、语法、缓冲区
a、 b，c，perl，d
a、 w，c33，爪哇，d
a、 e，c，perl，d
a、 s，c，python，d1
a、 n，c，php，d3
d、 r，hhh，cpp，d0
d、 m，hhh，c#，d0
a、 o，c，pdf，d3
a、 f，c，python，dd
a、 h，c，perl，dg
a、 yb，c，c，ddf
a、 b，c，perl，dt
西澳大利亚州，b，c33，爪哇州，d
d、 buuu，hhh，cpp，d0
d44，b，hhh，nlp，d0
a、 be，c，js，d4
西澳大利亚州，b，c33，爪哇州，d
wa，b，c33，python，d
wa，b，c33，python，d
wa、b、c33、c#、d
wa，b，c33，python，d
wa、b、c33、php、d
wa，b，c33，python，d
wa、b、c33、php、d
wa，b，c33，python，d
wa，b，c33，perl，d
wa、b、c33、php、d
西澳大利亚州，b，c33，爪哇州，d
wa，b，c33，python，d

附：另见

PPS：乍一看，如果你想花时间在这上面，至少有六件事情可能会在这里有所调整

我有一个近20k行的巨大CSV文件，格式如下：

file,tools,edit,syntax,buffers
a,b,c,perl,d
a,w,c33,java,d
a,e,c,perl,d
a,s,c,python,d1
a,n,c,php,d3
d,r,hhh,cpp,d0
d,m,hhh,c#,d0
a,o,c,pdf,d3
a,f,c,python,dd
a,h,c,perl,dg
a,yb,c,c,ddf
a,b,c,perl,dt
wa,b,c33,java,d
d,buuu,hhh,cpp,d0
d44,b,hhh,nlp,d0
a,be,c,js,d4
wa,b,c33,java,d
wa,b,c33,python,d
wa,b,c33,python,d
wa,b,c33,c#,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,php,d
wa,b,c33,python,d
wa,b,c33,perl,d
wa,b,c33,php,d
wa,b,c33,java,d
wa,b,c33,python,d

这一点都不算大。文件大小可能约为1兆字节