Perl 根据修改的日期和时间对所有记录进行排序
我的代码有一些问题。我有1GB的记录,其中我必须根据日期和时间进行排序。记录如下所示:Perl 根据修改的日期和时间对所有记录进行排序,perl,activestate,Perl,Activestate,我的代码有一些问题。我有1GB的记录,其中我必须根据日期和时间进行排序。记录如下所示: TYP|u期刊文章| KEY|u 1926000001 | AED|TIT|u一位十八世纪后期的纯粹主义者| TPA|GLO|u乔治·坎贝尔和他的同时代人的声明,这是时代所搁置的。| AUT|u Bryan W.F.| AUS| AFF.| RES u124; ed | ed | TOC | FJN| FJN |语言学研究| 38 |,(1240)CNC-CNC-CNC-U1240号网站URL-U1240号网
TYP|u期刊文章| KEY|u 1926000001 | AED|TIT|u一位十八世纪后期的纯粹主义者| TPA|GLO|u乔治·坎贝尔和他的同时代人的声明,这是时代所搁置的。| AUT|u Bryan W.F.| AUS| AFF.| RES u124; ed | ed | TOC | FJN| FJN |语言学研究| 38 |,(1240)CNC-CNC-CNC-U1240号网站URL-U1240号网站的URL-U1240号网站的URL-CNC-U1240号网站的CNC-U1240号网站的URL-U1240号网站的URL-UU1240号网站的URL-U1240号网站的URL-U1246号网站的网站,本次网站的网站的网站,本期网站的网站的网站,本期网站的网站的网站,本期网站的U1244号网站,本期网站的网站的网站,本次UUU1240号网站的门门门第二号,本网站的网站,本周四周四周四,本周四周四周四周四周四周四周四,本周四,本周四周四,本网站的网站的门门门门门门第二号,本周四周四,本网站的网站,本周四周四周四,外号,本网站,本网站,本网站,本网站,本网站,本网站,本网站,本网站的门门门第二号,本网站,门2003年9月15日下午3:12:28 MDT 2017年5月16日上午9:18:40
我使用MDT_5/16/2017 9:18:40 AM对这些记录进行排序
我使用了以下技巧:
MDT
和不使用MDT
)
对于MDT数据代码:
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD @Dt_ModifiedDate;
$doc_MD->close;
close (read_file);
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD @un_ModifiedDate;
$doc_UMD->close;
close (read_file);
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
foreach my $changes (@modi_date)
{
chomp($changes);
$Count_pro++;
@ab = grep (/$changes/, @all_data_with_time);
print $doc1 ("@ab\n");
$progress_bar->update($Count_pro);
}
$doc1->close;
但这一过程需要更多的时间。有什么方法可以在短时间内完成吗?正如您所指出的,在内存中完成所有操作在您的机器上都不是一个选项。但是,我不明白你为什么要首先对日期进行排序, 然后用该日期对所有记录进行grep,而不是在该日期对所有记录进行排序 我还怀疑,如果你逐行浏览原始文件,而不是在一个巨大的地图排序分割地图中,你可能会节省一些内存, 但是我会让你自己去尝试——这样可以节省你创建文件然后重新解析的时间 我建议一次性完成2+3: 跳过建筑@modi_date(我们看不见的地方:/)
my$mdt_fn='with_mdt.txt';#新的(
模式=>'%m/%d/%Y%r',
);
#从文件中获取所有记录。为了确保我们只需要分析一次该行,
#将日期时间存储在hashref中。
我的@记录;
while(我的$line=){
推送@records{
dt=>\u dt\u来自记录($line),
记录=>$line,
};
}
#如果您希望CMP而不是进行日期时间比较,
#根据记录改编,使用“cmp”而不是“cmp”
@记录=排序{$a->{dt}$b->{dt}@records;
打开(my$out_fh,“>:encoding(UTF-8)”,“sorted.txt”)或
die“无法打开要写入的文件:$!”;
#如果您希望从最新到最旧,请先反转
打印$out\u fh$\u->{record}.\n“用于@records;
关闭$OFH;
#我更喜欢使用DateTime。
#如果设置了某个日期,但无法解析,则使用解析器会提醒我。
#如果你想多给自己一些时间,
#为什么不将解析后的日期存储在文件中呢。然而,我怀疑这需要很长时间。
sub_dt_自_记录{
我的$record=班次;
$record=~/MDT([^\|]+)/;
返回$dt_parser->parse_datetime($1);
}
我终于做到了。
完整代码为:-
use warnings;
use strict;
use 5.010;
use Cwd;
binmode STDOUT, ":utf8";
use Date::Simple ('date', 'today');
use Time::Simple;
use Encode;
use Time::Piece;
use Win32::Console::ANSI;
use Term::ANSIScreen qw/:color /;
use File::Copy;
BEGIN {our $start_run = time();
my $Start = localtime;
print colored ['bold green'], ("\nstart time :- $Start\n");
}
##vairable
my $current_dir = getcwd();
my $current_in = $ARGV[0];
my $current_ou = $ARGV[1];
my @un_ext_file;
my @un_ext_file1;
my $current_data =today();
my $time = Time::Simple->new();
my $hour = $time->hours;
my $minute = $time->minutes;
my $second = $time->seconds;
my $current_time = "$hour"."-"."$minute"."-"."$second";
my $ren_folder = "output_"."$current_data"."_"."$current_time";
##check for output name DIR
opendir(DIR1, $current_ou);
my @current_ou_folder = readdir(DIR1);
closedir(DIR1);
foreach my $entry (@current_ou_folder)
{
if ($entry eq "output")
{
move "$current_ou/output" , "$current_ou/$ren_folder";
mkdir "$current_ou/output";
}
else
{
mkdir "$current_ou/output";
}
}
opendir(DIR, $current_in);
my @files_and_folder = readdir(DIR);
closedir(DIR);
foreach my $entry (@files_and_folder)
{
next if $entry eq '.' or $entry eq '..';
next if -d $entry;
push(@un_ext_file1, $entry);
}
##### check duplicate file name
my %seen;
my @file_test;
foreach my $file_name (@un_ext_file1)
{
if ($file_name =~ /(.*)\.([a-z]+)$/)
{
push (@file_test, $1);
}
else
{
push (@file_test, $file_name);
}
}
foreach my $string (@file_test)
{
next unless $seen{$string}++;
print "'$string' is duplicated.\n";
}
##collect all file from array
foreach my $file_name (@un_ext_file1)
{
my $REC_counter=0;
if ($file_name =~ /(.*)\.([a-z]+)$/) #####work for all extension
{
my $file_name_with_out_ext = $1;
my @modi_date_not_found;
eval{
#####read source file
#####First short file date wise (old date appear first then new date apper in last)
##### To get modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD @Dt_ModifiedDate;
$doc_MD->close;
close (read_file);
@Dt_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll ModifiedDate data Filtered\n\n");
##### To get un-modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD @un_ModifiedDate;
$doc_UMD->close;
close (read_file);
@un_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll unModifiedDate data Filtered\n\n\n\n");
##### Read ModifiedDate
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.ModifiedDate" || die "file found $!";
my @all_ModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
##### write in sotred_data file ModifiedDate after sorting all data.
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
print $doc1 sort { (toISO8601($a)) cmp (toISO8601($b)) } @all_ModifiedDate;
$doc1->close;
##### Read sorted_data and do in reverse order and then read unModifiedDate data and write in final file.
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.sorted_data" || die "file found $!";
my @all_sorted_data = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
@all_sorted_data = reverse (@all_sorted_data);
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.unModifiedDate" || die "file found $!";
my @all_unModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
my $doc_final = new IO::File(">$current_ou/output/$1.txt");
$doc_final->binmode(':utf8');
print $doc_final @all_sorted_data;
print $doc_final @all_unModifiedDate;
$doc_final->close;
unlink("$current_ou/output/$file_name_with_out_ext.ModifiedDate");
unlink("$current_ou/output/$file_name_with_out_ext.sorted_data");
unlink("$current_ou/output/$file_name_with_out_ext.unModifiedDate");
}
}
}
#####Process Complete.
say "\n\n---------------------------------------------";
print colored ['bold green'], ("\tProcess Completed\n");
say "---------------------------------------------\n";
get_time();
sub toISO8601
{
my $record = shift;
$record =~ /MDT_([^\|]+)/;
return(Time::Piece->strptime($1, '%m/%d/%Y %I:%M:%S %p')->datetime);
}
sub get_time
{
my $end_run = time();
my $run_time = $end_run - our $start_run;
#my $days = int($sec/(24*60*60));
my $hours = ($run_time/(60*60))%24;
my $mins =($run_time/60)%60;
my $secs = $run_time%60;
print "\nJob took";
print colored ['bold green'], (" $hours:$mins:$secs ");
print "to complete this process\n";
my $End = localtime;
print colored ['bold green'], ("\nEnd time :- $End\n");
}
使用警告;
严格使用;
使用5.010;
使用化学武器;
binmode标准输出“:utf8”;
使用日期::Simple('Date','today');
使用时间:简单;
使用编码;
使用时间::件;
使用Win32::Console::ANSI;
使用术语::ANSIScreen qw/:color/;
使用文件::复制;
开始{我们的$start_run=time();
my$Start=localtime;
彩色打印['bold green'],(“\n开始时间:-$Start\n”);
}
##虚荣的
我的$current_dir=getcwd();
我的$current_in=$ARGV[0];
my$current_ou=$ARGV[1];
我的@un_ext_文件;
我的@un_ext_文件1;
my$current_data=today();
my$time=time::Simple->new();
我的$hour=$time->hours;
我的$minute=$time->minutes;
我的$second=$time->seconds;
我的$current_time=“$hour.”-“$minute.”-“$second”;
my$ren_folder=“output_u”。“$current_data”。“.”“$current_time”;
##检查输出名称目录
opendir(DIR1,$current_ou);
我的@current\u ou\u folder=readdir(DIR1);
closedir(DIR1);
foreach my$条目(@current\u ou\u文件夹)
{
如果($entry eq“output”)
{
移动“$current\u ou/output”、“$current\u ou/$ren\u文件夹”;
mkdir“$current_ou/output”;
}
其他的
{
mkdir“$current_ou/output”;
}
}
opendir(DIR,$current_in);
我的@files_和_folder=readdir(DIR);
closedir(DIR);
foreach my$条目(@files\u和\u文件夹)
{
下一步如果$entry eq''或$entry eq'…';
下一个if-d$条目;
推送(@un_ext_file1,$entry);
}
#####检查重复的文件名
我看到的百分比;
我的@file_测试;
foreach my$file_name(@un_ext_file1)
{
如果($file\u name=~/(.*)\([a-z]+)$/)
{
推送(@file_test,$1);
}
其他的
{
push(@file\u test,$file\u name);
}
}
foreach my$string(@file\u test)
{
下一步除非$seen{$string}++;
打印“$string”重复。\n”;
}
##从数组中收集所有文件
foreach my$file_name(@un_ext_file1)
{
我的$REC_计数器=0;
如果($file\u name=~/(.*)\([a-z]+)$/)\n为所有扩展工作
{
我的$file\u name\u with\u out\u ext=$1;
我的@modi_日期未找到;
评估{
#####读取源文件
#####第一个短文件日期(先显示旧日期,然后最后显示新日期)
#####从文件中获取modifiedDate
打开read_文件,'simbabque:'只过滤这个问题,但不放任何答案LOL。为什么你要写文件而不是简单地一次完成所有你想做的事情?也就是说,把有日期的记录推到一个数组中,其他的推到另一个数组中,然后对第一个数组排序,写出结果,完成?另外,我更喜欢使用DateTime或类似于按da排序的东西你能举一个例子吗?你似乎要打开文件两次,一次是把所有有日期的记录写到另一个文件,一次是把没有日期的记录写到一个文件。然后你打开你写的第一个文件,再次读取所有记录,对它们进行排序,然后再把它们写到另一个文件。为什么不检查所有记录,把有日期的r在一个数组中,将不带日期的数据放入另一个数组,对带日期的数据进行排序,并一次性将结果写入文件?bytepusher@如果我使用下面的技术在内存中进行排序,则会产生一个错误“内存不足”:-my@sorted=map$\u->[0],sort{$a->[-2]cmp$b->[-2]}map[$\uu,split/\\124;/]
my $mdt_fn = 'with_mdt.txt'; # <- whatever name you gave that file?
open ( my $fh, '< :encoding(UTF-8)', $mdt_fn )
or die "could not open file '$mdt_fn' to read: $!";
my $dt_parser = DateTime::Format::Strptime->new(
pattern => '%m/%d/%Y %r',
);
# get all records from file. To ensure we only need to parse the line once,
# store the datetime in a hashref.
my @records;
while ( my $line = <$fh> ){
push @records, {
dt => _dt_from_record($line),
record => $line,
};
}
# If you wanted to CMP rather than doing datetime comparison,
# adapt _dt_from_record and use 'cmp' instead of '<=>'
@records = sort{ $a->{dt} <=> $b->{dt} }@records;
open ( my $out_fh, '> :encoding(UTF-8)', 'sorted.txt') or
die "could not open file to write to: $!";
# Or reverse first if you want latest to oldest
print $out_fh $_->{record}."\n" for @records;
close $out_fh;
# I prefer using DateTime for this.
# Using a parser will alert me if some date was set, but cannot be parsed.
# If you want to spare yourself some additional time,
# why not store the parsed date in the file. However, I doubt this takes long.
sub _dt_from_record {
my $record = shift;
$record =~ /MDT_([^\|]+)/;
return $dt_parser->parse_datetime($1);
}
use warnings;
use strict;
use 5.010;
use Cwd;
binmode STDOUT, ":utf8";
use Date::Simple ('date', 'today');
use Time::Simple;
use Encode;
use Time::Piece;
use Win32::Console::ANSI;
use Term::ANSIScreen qw/:color /;
use File::Copy;
BEGIN {our $start_run = time();
my $Start = localtime;
print colored ['bold green'], ("\nstart time :- $Start\n");
}
##vairable
my $current_dir = getcwd();
my $current_in = $ARGV[0];
my $current_ou = $ARGV[1];
my @un_ext_file;
my @un_ext_file1;
my $current_data =today();
my $time = Time::Simple->new();
my $hour = $time->hours;
my $minute = $time->minutes;
my $second = $time->seconds;
my $current_time = "$hour"."-"."$minute"."-"."$second";
my $ren_folder = "output_"."$current_data"."_"."$current_time";
##check for output name DIR
opendir(DIR1, $current_ou);
my @current_ou_folder = readdir(DIR1);
closedir(DIR1);
foreach my $entry (@current_ou_folder)
{
if ($entry eq "output")
{
move "$current_ou/output" , "$current_ou/$ren_folder";
mkdir "$current_ou/output";
}
else
{
mkdir "$current_ou/output";
}
}
opendir(DIR, $current_in);
my @files_and_folder = readdir(DIR);
closedir(DIR);
foreach my $entry (@files_and_folder)
{
next if $entry eq '.' or $entry eq '..';
next if -d $entry;
push(@un_ext_file1, $entry);
}
##### check duplicate file name
my %seen;
my @file_test;
foreach my $file_name (@un_ext_file1)
{
if ($file_name =~ /(.*)\.([a-z]+)$/)
{
push (@file_test, $1);
}
else
{
push (@file_test, $file_name);
}
}
foreach my $string (@file_test)
{
next unless $seen{$string}++;
print "'$string' is duplicated.\n";
}
##collect all file from array
foreach my $file_name (@un_ext_file1)
{
my $REC_counter=0;
if ($file_name =~ /(.*)\.([a-z]+)$/) #####work for all extension
{
my $file_name_with_out_ext = $1;
my @modi_date_not_found;
eval{
#####read source file
#####First short file date wise (old date appear first then new date apper in last)
##### To get modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD @Dt_ModifiedDate;
$doc_MD->close;
close (read_file);
@Dt_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll ModifiedDate data Filtered\n\n");
##### To get un-modifiedDate from the file
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD @un_ModifiedDate;
$doc_UMD->close;
close (read_file);
@un_ModifiedDate=undef; ##### free after use
print colored ['bold green'], ("\n\tAll unModifiedDate data Filtered\n\n\n\n");
##### Read ModifiedDate
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.ModifiedDate" || die "file found $!";
my @all_ModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
##### write in sotred_data file ModifiedDate after sorting all data.
my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
print $doc1 sort { (toISO8601($a)) cmp (toISO8601($b)) } @all_ModifiedDate;
$doc1->close;
##### Read sorted_data and do in reverse order and then read unModifiedDate data and write in final file.
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.sorted_data" || die "file found $!";
my @all_sorted_data = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
@all_sorted_data = reverse (@all_sorted_data);
open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.unModifiedDate" || die "file found $!";
my @all_unModifiedDate = <read_file_ModifiedDate>;
close(read_file_ModifiedDate);
my $doc_final = new IO::File(">$current_ou/output/$1.txt");
$doc_final->binmode(':utf8');
print $doc_final @all_sorted_data;
print $doc_final @all_unModifiedDate;
$doc_final->close;
unlink("$current_ou/output/$file_name_with_out_ext.ModifiedDate");
unlink("$current_ou/output/$file_name_with_out_ext.sorted_data");
unlink("$current_ou/output/$file_name_with_out_ext.unModifiedDate");
}
}
}
#####Process Complete.
say "\n\n---------------------------------------------";
print colored ['bold green'], ("\tProcess Completed\n");
say "---------------------------------------------\n";
get_time();
sub toISO8601
{
my $record = shift;
$record =~ /MDT_([^\|]+)/;
return(Time::Piece->strptime($1, '%m/%d/%Y %I:%M:%S %p')->datetime);
}
sub get_time
{
my $end_run = time();
my $run_time = $end_run - our $start_run;
#my $days = int($sec/(24*60*60));
my $hours = ($run_time/(60*60))%24;
my $mins =($run_time/60)%60;
my $secs = $run_time%60;
print "\nJob took";
print colored ['bold green'], (" $hours:$mins:$secs ");
print "to complete this process\n";
my $End = localtime;
print colored ['bold green'], ("\nEnd time :- $End\n");
}