如何使用Perl从大文件中删除非唯一行？_Perl_Batch File

如何使用Perl从大文件中删除非唯一行？

perl batch-file

如何使用Perl从大文件中删除非唯一行？,perl,batch-file,Perl,Batch File,使用Perl在Windows中通过批处理文件在中调用重复数据删除 Windows中通过批处理文件调用的DOS窗口。批处理文件调用执行操作的Perl脚本。我有批处理文件。只要数据文件不是太大，我的代码脚本就会删除重复数据。需要解决的问题是，对于较大的数据文件（2 GB或更大），使用此大小的文件，在尝试将完整文件加载到阵列中以删除重复数据时，会发生内存错误。内存错误发生在以下位置的子例程中：- @contents_of_the_file = <INFILE>; @content

使用Perl在Windows中通过批处理文件在中调用重复数据删除 Windows中通过批处理文件调用的DOS窗口。批处理文件调用执行操作的Perl脚本。我有批处理文件。只要数据文件不是太大，我的代码脚本就会删除重复数据。需要解决的问题是，对于较大的数据文件（2 GB或更大），使用此大小的文件，在尝试将完整文件加载到阵列中以删除重复数据时，会发生内存错误。内存错误发生在以下位置的子例程中：-

@contents_of_the_file = <INFILE>;

@contents\u的_文件=；

（完全不同的方法是可以接受的，只要它解决了这个问题，请建议）。子程序是：-

sub remove_duplicate_data_and_file
{
 open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
  if ($test ne "YES")
   {
    flock(INFILE,1);
   }
  @contents_of_the_file = <INFILE>;
  if ($test ne "YES")
   {
    flock(INFILE,8);
   }
 close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
 @unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, @contents_of_the_file);

 open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
 if ($test ne "YES")
  {
   flock(OUTFILE,1);
  }
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
  {
   print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
  }
 if ($test ne "YES")
  {
   flock(OUTFILE,8);
  }
}

子删除重复的数据和文件
{
open（infle，“Perl对大文件的处理非常出色，但2GB可能是DOS/Windows的一个限制
你有多少公羊
如果您的操作系统没有抱怨，最好一次读取一行文件，然后立即写入输出
我正在考虑使用菱形操作符，但我不愿意推荐任何代码，因为在我发布代码的时候，我冒犯了一位Perl大师
我不想冒险。我希望骑兵很快就会到达
同时，一个链接。
您应该能够使用哈希有效地完成此操作。您不需要存储行中的数据，只需确定哪些行是相同的。因此

不要咕哝——一次只读一行
散列这行
将散列行表示作为键存储在Perl列表散列中。将行号存储为列表的第一个值
如果该键已存在，请将重复行号附加到与该值对应的列表中

在这个过程结束时，您将拥有一个识别所有重复行的数据结构。然后，您可以对该文件进行第二次遍历以删除这些重复项。
您不必要地将原始文件的完整副本存储在该文件的@contents\u中，
，如果复制量相对于文件大小而言较低--在\u文件的%unique\u contents\u
和\u文件的@unique\u contents\u
中还有将近两个完整副本。正如ire\u和\u curses
所指出的，您可以通过对数据进行两次传递来降低存储要求：（1）分析文件，存储有关非重复行的行号的信息；（2）再次处理该文件以将非重复写入输出文件
这是一个例子。我不知道我是否为哈希函数（）选择了最好的模块；也许其他人会对此发表评论。还要注意你应该使用的open（）
的三参数形式
use strict;
use warnings;

use Digest::MD5 qw(md5);

my (%seen, %keep_line_nums);
my $in_file  = 'data.dat';
my $out_file = 'data_no_dups.dat';

open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;

while ( defined(my $line = <$in_handle>) ){
    my $hashed_line = md5($line);
    $keep_line_nums{$.} = 1 unless $seen{$hashed_line};
    $seen{$hashed_line} = 1;
}

seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
    print $out_handle $line if $keep_line_nums{$.};
}    

close $in_handle;
close $out_handle;

使用严格；
使用警告；
使用摘要：：MD5QW（MD5）；
我的%（已看到%、保持一致）；
我的$in_文件='data.dat'；
my$out_file='data_no_dups.dat'；
打开（我的$in_句柄，，$out_文件）或死亡$！；
while（已定义（my$line=））{
my$hashd_line=md5（$line）；
$keep_line_nums{$.}=1，除非$seen{$hashed_line}；
$seen{$hashed_line}=1；
}
寻找$in_handle，0，0；
$. = 0;
while（已定义（my$line=））{
如果$keep\u line\u nums{$.}，则打印$out\u handle$line；
}    
关闭$in_句柄；
关闭$out_句柄；
在“完全不同的方法”类别中，如果您有Unix命令（例如Cygwin）：
这应该是可行的——根本不需要Perl——这可能会，也可能不会，解决您的内存问题
编辑：能够更好地处理大文件的替代解决方案可以使用以下算法：
逐行读取内嵌
将每行散列为一个小散列（例如，散列#mod 10）
将每一行附加到哈希数唯一的文件（例如tmp-1到tmp-10）
封闭填充
打开每个tmp-#并将其排序为一个新的文件-#
Mergesort sortedtmp-[1-10]（即打开所有10个文件并同时读取），跳过重复项并将每次迭代写入最终输出文件
对于非常大的文件，这将比slurping更安全
第2部分和第3部分可以更改为随机数，而不是哈希数mod 10
这里有一个脚本可能会有所帮助（尽管我还没有测试过）：
您可以使用命令行perl的内联替换模式
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename

这是一个无论文件有多大都能工作的解决方案。但它不专门使用RAM，因此比基于RAM的解决方案慢。您还可以指定要使用的RAM量
该解决方案使用一个临时文件，程序将该文件视为带有SQLite的数据库
#!/usr/bin/perl

use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;

my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  $sha= sha1_base64($line);
  unless($cx->selectrow_array($find, undef, $sha)) {
    $insert->execute($sha, $line_number)}
  $line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  if($next_line_number == $line_number) {
    say $line;
    $next_line_number= $list->fetchrow_array;
    last unless $next_line_number;
  }
  $line_number++;
}
close FILE;

！/usr/bin/perl
使用DBI；
使用摘要：：SHA'sha1_base64'；
使用Modern:：Perl；
我的$input=shift；
my$temp='unique.tmp'；
我的$cache\u size\u单位：mb=100；
如果-f$temp，则取消链接$temp；
my$cx=DBI->connect（“DBI:SQLite:dbname=$temp”）；
$cx->do（“PRAGMA cache_size=“.$cache_size_in_mb*1000”）；
$cx->do（“创建表x（id varchar（86）主键，行int unique）”；
my$find=$cx->prepare（“从x中选择行，其中id=？”；
my$list=$cx->prepare（“按行从x订单中选择行”）；
my$insert=$cx->prepare（“插入x（id，行）值（？）”；
打开（文件，$input）或死亡$！；
我的（$line\u number，$next\u line\u number，$line，$sha）=1；
而（$line=）{
$line=~s/\s+$//s；
$sha=sha1_base64（$line）；
除非（$cx->selectrow\u数组（$find，undf，$sha））{
$insert->execute（$sha，$line\U number）}
$line_number++；
}
查找文件，0，0；
$list->execute；
$line_编号=1；
$next\u line\u number=$list->fetchrow\u数组；
而（$line=）{
$line=~s/\s+$//s；
如果（$next\U line\U number==$line\U number）{
比如说$line；
$next\u line\u number=$list->fetchrow\u数组；
最后，除非$next\u行\u编号；
}
$line_number++；
}
关闭文件；
无论操作系统抱怨还是其他原因，对2GB文件进行SLURP总是一个坏主意
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename

#!/usr/bin/perl

use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;

my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  $sha= sha1_base64($line);
  unless($cx->selectrow_array($find, undef, $sha)) {
    $insert->execute($sha, $line_number)}
  $line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
  $line=~ s/\s+$//s;
  if($next_line_number == $line_number) {
    say $line;
    $next_line_number= $list->fetchrow_array;
    last unless $next_line_number;
  }
  $line_number++;
}
close FILE;