Perl 根据修改的日期和时间对所有记录进行排序_Perl_Activestate

Perl 根据修改的日期和时间对所有记录进行排序

perl

Perl 根据修改的日期和时间对所有记录进行排序,perl,activestate,Perl,Activestate,我的代码有一些问题。我有1GB的记录，其中我必须根据日期和时间进行排序。记录如下所示： TYP|u期刊文章| KEY|u 1926000001 | AED|TIT|u一位十八世纪后期的纯粹主义者| TPA|GLO|u乔治·坎贝尔和他的同时代人的声明，这是时代所搁置的。| AUT|u Bryan W.F.| AUS| AFF.| RES u124; ed | ed | TOC | FJN| FJN |语言学研究| 38 |，（1240）CNC-CNC-CNC-U1240号网站URL-U1240号网

我的代码有一些问题。我有1GB的记录，其中我必须根据日期和时间进行排序。记录如下所示：

TYP|u期刊文章| KEY|u 1926000001 | AED|TIT|u一位十八世纪后期的纯粹主义者| TPA|GLO|u乔治·坎贝尔和他的同时代人的声明，这是时代所搁置的。| AUT|u Bryan W.F.| AUS| AFF.| RES u124; ed | ed | TOC | FJN| FJN |语言学研究| 38 |，（1240）CNC-CNC-CNC-U1240号网站URL-U1240号网站的URL-U1240号网站的URL-CNC-U1240号网站的CNC-U1240号网站的URL-U1240号网站的URL-UU1240号网站的URL-U1240号网站的URL-U1246号网站的网站，本次网站的网站的网站，本期网站的网站的网站，本期网站的网站的网站，本期网站的U1244号网站，本期网站的网站的网站，本次UUU1240号网站的门门门第二号，本网站的网站，本周四周四周四，本周四周四周四周四周四周四周四，本周四，本周四周四，本网站的网站的门门门门门门第二号，本周四周四，本网站的网站，本周四周四周四，外号，本网站，本网站，本网站，本网站，本网站，本网站，本网站，本网站的门门门第二号，本网站，门2003年9月15日下午3:12:28 MDT 2017年5月16日上午9:18:40

我使用MDT_5/16/2017 9:18:40 AM对这些记录进行排序

我使用了以下技巧：

我筛选有或没有MDT的文件（创建两个文件，分别使用

MDT

和不使用

MDT

）

对于MDT数据代码：

open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
$doc_MD->binmode(':utf8');
print $doc_MD @Dt_ModifiedDate;
$doc_MD->close;
close (read_file);

open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my @un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
$doc_UMD->binmode(':utf8');
print $doc_UMD @un_ModifiedDate;
$doc_UMD->close;
close (read_file);

根据排序的日期和时间，我从MDT_文件中删除所有记录。最后创建最终文件

my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
$doc1->binmode(':utf8');
foreach my $changes (@modi_date)
{
chomp($changes);
$Count_pro++;
@ab = grep (/$changes/, @all_data_with_time);
print $doc1 ("@ab\n");
$progress_bar->update($Count_pro);
}
$doc1->close;

但这一过程需要更多的时间。有什么方法可以在短时间内完成吗？

正如您所指出的，在内存中完成所有操作在您的机器上都不是一个选项。但是，我不明白你为什么要首先对日期进行排序，然后用该日期对所有记录进行grep，而不是在该日期对所有记录进行排序

我还怀疑，如果你逐行浏览原始文件，而不是在一个巨大的地图排序分割地图中，你可能会节省一些内存，但是我会让你自己去尝试——这样可以节省你创建文件然后重新解析的时间

我建议一次性完成2+3：

跳过建筑@modi_date（我们看不见的地方：/）

my$mdt_fn='with_mdt.txt'；#新的(
模式=>'%m/%d/%Y%r'，
);
#从文件中获取所有记录。为了确保我们只需要分析一次该行，
#将日期时间存储在hashref中。
我的@记录；
while（我的$line=）{
推送@records{
dt=>\u dt\u来自记录（$line），
记录=>$line，
};
}
#如果您希望CMP而不是进行日期时间比较，
#根据记录改编，使用“cmp”而不是“cmp”
@记录=排序{$a->{dt}$b->{dt}@records；
打开（my$out_fh，“>：encoding（UTF-8）”，“sorted.txt”）或
die“无法打开要写入的文件：$！”；
#如果您希望从最新到最旧，请先反转
打印$out\u fh$\u->{record}.\n“用于@records；
关闭$OFH；
#我更喜欢使用DateTime。
#如果设置了某个日期，但无法解析，则使用解析器会提醒我。
#如果你想多给自己一些时间，
#为什么不将解析后的日期存储在文件中呢。然而，我怀疑这需要很长时间。
sub_dt_自_记录{
我的$record=班次；
$record=~/MDT（[^\|]+）/；
返回$dt_parser->parse_datetime（$1）；
}

我终于做到了。完整代码为：-

use warnings;
use strict;
use 5.010;
use Cwd;
binmode STDOUT, ":utf8";
use Date::Simple ('date', 'today');
use Time::Simple;
use Encode;
use Time::Piece;
use Win32::Console::ANSI;
use Term::ANSIScreen qw/:color /;
use File::Copy;

BEGIN {our $start_run = time();
    my $Start = localtime;
    print colored ['bold green'], ("\nstart time :- $Start\n");
}
##vairable
my $current_dir = getcwd();
my $current_in = $ARGV[0];
my $current_ou = $ARGV[1];
my @un_ext_file;
my @un_ext_file1;
my $current_data =today();
my $time   = Time::Simple->new();
my $hour   = $time->hours;
my $minute = $time->minutes;
my $second = $time->seconds;
my $current_time = "$hour"."-"."$minute"."-"."$second";
my $ren_folder = "output_"."$current_data"."_"."$current_time";

##check for output name DIR
opendir(DIR1, $current_ou);
my @current_ou_folder = readdir(DIR1);
closedir(DIR1);
foreach my $entry (@current_ou_folder)
{
    if ($entry eq "output")
    {
        move "$current_ou/output" , "$current_ou/$ren_folder";
        mkdir "$current_ou/output";
    }
    else
    {
        mkdir "$current_ou/output";
    }
}

opendir(DIR, $current_in);
my @files_and_folder = readdir(DIR);
closedir(DIR);
foreach my $entry (@files_and_folder)
{
    next if $entry eq '.' or $entry eq '..';
    next if -d $entry;
    push(@un_ext_file1, $entry);
}

##### check duplicate file name
my %seen;
my @file_test;
foreach my $file_name (@un_ext_file1)
{
    if ($file_name =~ /(.*)\.([a-z]+)$/)
    {
        push (@file_test, $1);
    }
    else
    {
        push (@file_test, $file_name);
    }
}
foreach my $string (@file_test)
{
    next unless $seen{$string}++;
    print "'$string' is duplicated.\n";
}

##collect all file from array
foreach my $file_name (@un_ext_file1)
{
    my $REC_counter=0;
    if ($file_name =~ /(.*)\.([a-z]+)$/)               #####work for all extension
    {
        my $file_name_with_out_ext = $1;
        my @modi_date_not_found;
        eval{
        #####read source file

        #####First short file date wise (old date appear first then new date apper in last)
        ##### To get modifiedDate from the file
        open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
        my @Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
        my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
        $doc_MD->binmode(':utf8');
        print $doc_MD @Dt_ModifiedDate;
        $doc_MD->close;
        close (read_file);
        @Dt_ModifiedDate=undef;  ##### free after use
        print colored ['bold green'], ("\n\tAll ModifiedDate data Filtered\n\n");

        ##### To get un-modifiedDate from the file
        open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
        my @un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
        my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
        $doc_UMD->binmode(':utf8');
        print $doc_UMD @un_ModifiedDate;
        $doc_UMD->close;
        close (read_file);
        @un_ModifiedDate=undef;  ##### free after use
        print colored ['bold green'], ("\n\tAll unModifiedDate data Filtered\n\n\n\n");

        ##### Read ModifiedDate
        open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.ModifiedDate" || die "file found $!";
        my @all_ModifiedDate = <read_file_ModifiedDate>;
        close(read_file_ModifiedDate);

        ##### write in sotred_data file ModifiedDate after sorting all data.
        my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
        $doc1->binmode(':utf8');
        print $doc1 sort { (toISO8601($a)) cmp (toISO8601($b)) } @all_ModifiedDate;
        $doc1->close;

        ##### Read sorted_data and do in reverse order and then read unModifiedDate data and write in final file.
        open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.sorted_data" || die "file found $!";
        my @all_sorted_data = <read_file_ModifiedDate>;
        close(read_file_ModifiedDate);
        @all_sorted_data = reverse (@all_sorted_data);

        open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.unModifiedDate" || die "file found $!";
        my @all_unModifiedDate = <read_file_ModifiedDate>;
        close(read_file_ModifiedDate);

        my $doc_final = new IO::File(">$current_ou/output/$1.txt");
        $doc_final->binmode(':utf8');
        print $doc_final @all_sorted_data;
        print $doc_final @all_unModifiedDate;
        $doc_final->close;

        unlink("$current_ou/output/$file_name_with_out_ext.ModifiedDate");
        unlink("$current_ou/output/$file_name_with_out_ext.sorted_data");
        unlink("$current_ou/output/$file_name_with_out_ext.unModifiedDate");
        }
    }
}

#####Process Complete.
say "\n\n---------------------------------------------";
print colored ['bold green'], ("\tProcess Completed\n");
say "---------------------------------------------\n";

get_time();

sub toISO8601
{
    my $record = shift;
    $record =~ /MDT_([^\|]+)/;
    return(Time::Piece->strptime($1, '%m/%d/%Y %I:%M:%S %p')->datetime);
}

sub get_time
{
    my $end_run = time();
    my $run_time = $end_run - our $start_run;
    #my $days = int($sec/(24*60*60));
    my $hours = ($run_time/(60*60))%24;
    my $mins =($run_time/60)%60;
    my $secs = $run_time%60;

    print "\nJob took";
    print colored ['bold green'], (" $hours:$mins:$secs ");
    print "to complete this process\n";

    my $End = localtime;
    print colored ['bold green'], ("\nEnd time :- $End\n");
}

使用警告；
严格使用；
使用5.010；
使用化学武器；
binmode标准输出“：utf8”；
使用日期：：Simple（'Date'，'today'）；
使用时间：简单；
使用编码；
使用时间：：件；
使用Win32:：Console:：ANSI；
使用术语：：ANSIScreen qw/：color/；
使用文件：：复制；
开始{我们的$start_run=time（）；
my$Start=localtime；
彩色打印['bold green']，（“\n开始时间：-$Start\n”）；
}
##虚荣的
我的$current_dir=getcwd（）；
我的$current_in=$ARGV[0]；
my$current_ou=$ARGV[1]；
我的@un_ext_文件；
我的@un_ext_文件1；
my$current_data=today（）；
my$time=time:：Simple->new（）；
我的$hour=$time->hours；
我的$minute=$time->minutes；
我的$second=$time->seconds；
我的$current_time=“$hour.”-“$minute.”-“$second”；
my$ren_folder=“output_u”。“$current_data”。“.”“$current_time”；
##检查输出名称目录
opendir（DIR1，$current_ou）；
我的@current\u ou\u folder=readdir（DIR1）；
closedir（DIR1）；
foreach my$条目（@current\u ou\u文件夹）
{
如果（$entry eq“output”）
{
移动“$current\u ou/output”、“$current\u ou/$ren\u文件夹”；
mkdir“$current_ou/output”；
}
其他的
{
mkdir“$current_ou/output”；
}
}
opendir（DIR，$current_in）；
我的@files_和_folder=readdir（DIR）；
closedir（DIR）；
foreach my$条目（@files\u和\u文件夹）
{
下一步如果$entry eq''或$entry eq'…'；
下一个if-d$条目；
推送（@un_ext_file1，$entry）；
}
#####检查重复的文件名
我看到的百分比；
我的@file_测试；
foreach my$file_name（@un_ext_file1）
{
如果（$file\u name=~/（.*）\（[a-z]+）$/）
{
推送（@file_test，$1）；
}
其他的
{
push（@file\u test，$file\u name）；
}
}
foreach my$string（@file\u test）
{
下一步除非$seen{$string}++；
打印“$string”重复。\n”；
}
##从数组中收集所有文件
foreach my$file_name（@un_ext_file1）
{
我的$REC_计数器=0；
如果（$file\u name=~/（.*）\（[a-z]+）$/）\n为所有扩展工作
{
我的$file\u name\u with\u out\u ext=$1；
我的@modi_日期未找到；
评估{
#####读取源文件
#####第一个短文件日期（先显示旧日期，然后最后显示新日期）
#####从文件中获取modifiedDate
打开read_文件，'simbabque:'只过滤这个问题，但不放任何答案LOL。为什么你要写文件而不是简单地一次完成所有你想做的事情？也就是说，把有日期的记录推到一个数组中，其他的推到另一个数组中，然后对第一个数组排序，写出结果，完成？另外，我更喜欢使用DateTime或类似于按da排序的东西你能举一个例子吗？你似乎要打开文件两次，一次是把所有有日期的记录写到另一个文件，一次是把没有日期的记录写到一个文件。然后你打开你写的第一个文件，再次读取所有记录，对它们进行排序，然后再把它们写到另一个文件。为什么不检查所有记录，把有日期的r在一个数组中，将不带日期的数据放入另一个数组，对带日期的数据进行排序，并一次性将结果写入文件？bytepusher@如果我使用下面的技术在内存中进行排序，则会产生一个错误“内存不足”：-my@sorted=map$\u->[0]，sort{$a->[-2]cmp$b->[-2]}map[$\uu，split/\\124;/]my $mdt_fn = 'with_mdt.txt'; # <- whatever name you gave that file?
open ( my $fh, '< :encoding(UTF-8)', $mdt_fn ) 
    or die "could not open file '$mdt_fn' to read: $!"; 

my $dt_parser = DateTime::Format::Strptime->new(
   pattern => '%m/%d/%Y %r',
);

# get all records from file. To ensure we only need to parse the line once,
# store the datetime in a hashref.
my @records;
while ( my $line = <$fh> ){
    push @records, {
        dt     => _dt_from_record($line),
        record => $line,
    };
}

# If you wanted to CMP rather than doing datetime comparison,
# adapt _dt_from_record and use 'cmp' instead of '<=>'
@records = sort{ $a->{dt} <=> $b->{dt} }@records;

open ( my $out_fh, '> :encoding(UTF-8)', 'sorted.txt') or 
    die "could not open file to write to: $!";

# Or reverse first if you want latest to oldest
print $out_fh $_->{record}."\n" for @records;
close $out_fh;

# I prefer using DateTime for this.
# Using a parser will alert me if some date was set, but cannot be parsed.
# If you want to spare yourself some additional time,
# why not store the parsed date in the file. However, I doubt this takes long.

sub _dt_from_record {

    my $record = shift;
    $record =~ /MDT_([^\|]+)/;
    return $dt_parser->parse_datetime($1);

}

use warnings;
use strict;
use 5.010;
use Cwd;
binmode STDOUT, ":utf8";
use Date::Simple ('date', 'today');
use Time::Simple;
use Encode;
use Time::Piece;
use Win32::Console::ANSI;
use Term::ANSIScreen qw/:color /;
use File::Copy;

BEGIN {our $start_run = time();
    my $Start = localtime;
    print colored ['bold green'], ("\nstart time :- $Start\n");
}
##vairable
my $current_dir = getcwd();
my $current_in = $ARGV[0];
my $current_ou = $ARGV[1];
my @un_ext_file;
my @un_ext_file1;
my $current_data =today();
my $time   = Time::Simple->new();
my $hour   = $time->hours;
my $minute = $time->minutes;
my $second = $time->seconds;
my $current_time = "$hour"."-"."$minute"."-"."$second";
my $ren_folder = "output_"."$current_data"."_"."$current_time";

##check for output name DIR
opendir(DIR1, $current_ou);
my @current_ou_folder = readdir(DIR1);
closedir(DIR1);
foreach my $entry (@current_ou_folder)
{
    if ($entry eq "output")
    {
        move "$current_ou/output" , "$current_ou/$ren_folder";
        mkdir "$current_ou/output";
    }
    else
    {
        mkdir "$current_ou/output";
    }
}

opendir(DIR, $current_in);
my @files_and_folder = readdir(DIR);
closedir(DIR);
foreach my $entry (@files_and_folder)
{
    next if $entry eq '.' or $entry eq '..';
    next if -d $entry;
    push(@un_ext_file1, $entry);
}

##### check duplicate file name
my %seen;
my @file_test;
foreach my $file_name (@un_ext_file1)
{
    if ($file_name =~ /(.*)\.([a-z]+)$/)
    {
        push (@file_test, $1);
    }
    else
    {
        push (@file_test, $file_name);
    }
}
foreach my $string (@file_test)
{
    next unless $seen{$string}++;
    print "'$string' is duplicated.\n";
}

##collect all file from array
foreach my $file_name (@un_ext_file1)
{
    my $REC_counter=0;
    if ($file_name =~ /(.*)\.([a-z]+)$/)               #####work for all extension
    {
        my $file_name_with_out_ext = $1;
        my @modi_date_not_found;
        eval{
        #####read source file

        #####First short file date wise (old date appear first then new date apper in last)
        ##### To get modifiedDate from the file
        open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
        my @Dt_ModifiedDate = grep { $_ =~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/i} <read_file>;
        my $doc_MD = new IO::File(">$current_ou/output/$file_name_with_out_ext.ModifiedDate");
        $doc_MD->binmode(':utf8');
        print $doc_MD @Dt_ModifiedDate;
        $doc_MD->close;
        close (read_file);
        @Dt_ModifiedDate=undef;  ##### free after use
        print colored ['bold green'], ("\n\tAll ModifiedDate data Filtered\n\n");

        ##### To get un-modifiedDate from the file
        open read_file, '<:encoding(UTF-8)', "$current_in/$file_name" || die "file found $!";
        my @un_ModifiedDate = grep { $_ !~ /MDT_([0-9]+)\/([0-9]+)\/([0-9]+) ([0-9]+):([0-9]+):([0-9]+) ([A-Z]+)/} <read_file>;
        my $doc_UMD = new IO::File(">$current_ou/output/$file_name_with_out_ext.unModifiedDate");
        $doc_UMD->binmode(':utf8');
        print $doc_UMD @un_ModifiedDate;
        $doc_UMD->close;
        close (read_file);
        @un_ModifiedDate=undef;  ##### free after use
        print colored ['bold green'], ("\n\tAll unModifiedDate data Filtered\n\n\n\n");

        ##### Read ModifiedDate
        open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.ModifiedDate" || die "file found $!";
        my @all_ModifiedDate = <read_file_ModifiedDate>;
        close(read_file_ModifiedDate);

        ##### write in sotred_data file ModifiedDate after sorting all data.
        my $doc1 = new IO::File(">$current_ou/output/$file_name_with_out_ext.sorted_data");
        $doc1->binmode(':utf8');
        print $doc1 sort { (toISO8601($a)) cmp (toISO8601($b)) } @all_ModifiedDate;
        $doc1->close;

        ##### Read sorted_data and do in reverse order and then read unModifiedDate data and write in final file.
        open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.sorted_data" || die "file found $!";
        my @all_sorted_data = <read_file_ModifiedDate>;
        close(read_file_ModifiedDate);
        @all_sorted_data = reverse (@all_sorted_data);

        open read_file_ModifiedDate, '<:encoding(UTF-8)', "$current_ou/output/$file_name_with_out_ext.unModifiedDate" || die "file found $!";
        my @all_unModifiedDate = <read_file_ModifiedDate>;
        close(read_file_ModifiedDate);

        my $doc_final = new IO::File(">$current_ou/output/$1.txt");
        $doc_final->binmode(':utf8');
        print $doc_final @all_sorted_data;
        print $doc_final @all_unModifiedDate;
        $doc_final->close;

        unlink("$current_ou/output/$file_name_with_out_ext.ModifiedDate");
        unlink("$current_ou/output/$file_name_with_out_ext.sorted_data");
        unlink("$current_ou/output/$file_name_with_out_ext.unModifiedDate");
        }
    }
}

#####Process Complete.
say "\n\n---------------------------------------------";
print colored ['bold green'], ("\tProcess Completed\n");
say "---------------------------------------------\n";

get_time();

sub toISO8601
{
    my $record = shift;
    $record =~ /MDT_([^\|]+)/;
    return(Time::Piece->strptime($1, '%m/%d/%Y %I:%M:%S %p')->datetime);
}

sub get_time
{
    my $end_run = time();
    my $run_time = $end_run - our $start_run;
    #my $days = int($sec/(24*60*60));
    my $hours = ($run_time/(60*60))%24;
    my $mins =($run_time/60)%60;
    my $secs = $run_time%60;

    print "\nJob took";
    print colored ['bold green'], (" $hours:$mins:$secs ");
    print "to complete this process\n";

    my $End = localtime;
    print colored ['bold green'], ("\nEnd time :- $End\n");
}