Regex Perl:使用正则表达式的方式与散列相同

Regex Perl:使用正则表达式的方式与散列相同,regex,perl,Regex,Perl,我希望从regex获得与下面的哈希相同的输出。 我知道我的正则表达式很难看,但我正在努力改进它 因此,正则表达式的预期输出是: 20191122181858|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889 20191122181907|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889 代码如下: #!/usr/bin/perl use strict; use warn

我希望从regex获得与下面的哈希相同的输出。 我知道我的正则表达式很难看,但我正在努力改进它

因此,正则表达式的预期输出是:

20191122181858|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889
20191122181907|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889
代码如下:

#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);

my (%hash); # initialization

if (<DATA>) { # if DATA exists
        print "here the regex values: \n";
        while (<DATA>) { # open the DATA
                chomp $_; # removes characters at the end of line
                my @tab = split(/,/, $_); # split lines
                my ($http, $ts, $macin, $caid) = (@tab[2, 3, 4, 5]);
                my $timestamp = strftime '%Y%m%d%H%M%S', localtime($ts/1000); # from unix epoch time to human read-able date
                my @value = split(/\//, $http); # split values of the http
                my ($url, $filename) = ("http://$value[2]", $value[6]); # value in order to have url and the name of the file
                if (! $hash{$timestamp."|".$caid."|".$macin."|".$filename."|".$url}) { # starting hash in order to avoid duplicates
                        $hash{$timestamp."|".$caid."|".$macin."|".$filename."|".$url} = $timestamp."|".$caid."|".$macin."|".$filename."|".$url;
                }
                my $regex = $_; # trying to have same output with a regex
                $regex =~ s/(?:[^\/]*\/)([^\\*]*\/)([^\.*]*)([^\,*]*)(\,)([^\,*]*)(\,)(.*)(.*)/http:\/$1|$2|$3|$4|$5|$6|$7/;
                print $regex, "\n";
        }
}

if (%hash) { # checking if hahs exists and contains values
        print "\nhere the hash values: \n";
        foreach (sort keys %hash) {
                print $_, "\n";
        }
}

__DATA__
"@timestamp",url,ts,macin,caid
"Nov 22, 2019 @ 17:19:07.571","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4432595.ts/9836847.3018322401",1574443147021,40EVFVRFB,9836847
"Nov 22, 2019 @ 17:18:59.264","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4431573.ts/0292929.5002731501",1574443138223,BVFEFZZ9C4,0292929
here the regex values:
http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4432595.ts/|9836847|.3018322401"|,|1574443147021|,|40EVFVRFB,9836847
http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4431573.ts/|0292929|.5002731501"|,|1574443138223|,|BVFEFZZ9C4,0292929

here the hash values:
20191122181858|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889
20191122181907|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889
1574443147021|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889
1574443138223|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889
"@timestamp",url,ts,macin,caid
20191122181907|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889
20191122181858|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889

这个正则表达式与您想要的匹配,替换项会给出预期的结果,除了时间戳之外,您必须像在代码的第一部分中那样转换它:

^.+?(http://[^/]+).+/([^/]+?)/[^/]+?,(.+?),(.+?),(.+)
更换:
3美元5美元4美元2美元1美元

结果:

#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);

my (%hash); # initialization

if (<DATA>) { # if DATA exists
        print "here the regex values: \n";
        while (<DATA>) { # open the DATA
                chomp $_; # removes characters at the end of line
                my @tab = split(/,/, $_); # split lines
                my ($http, $ts, $macin, $caid) = (@tab[2, 3, 4, 5]);
                my $timestamp = strftime '%Y%m%d%H%M%S', localtime($ts/1000); # from unix epoch time to human read-able date
                my @value = split(/\//, $http); # split values of the http
                my ($url, $filename) = ("http://$value[2]", $value[6]); # value in order to have url and the name of the file
                if (! $hash{$timestamp."|".$caid."|".$macin."|".$filename."|".$url}) { # starting hash in order to avoid duplicates
                        $hash{$timestamp."|".$caid."|".$macin."|".$filename."|".$url} = $timestamp."|".$caid."|".$macin."|".$filename."|".$url;
                }
                my $regex = $_; # trying to have same output with a regex
                $regex =~ s/(?:[^\/]*\/)([^\\*]*\/)([^\.*]*)([^\,*]*)(\,)([^\,*]*)(\,)(.*)(.*)/http:\/$1|$2|$3|$4|$5|$6|$7/;
                print $regex, "\n";
        }
}

if (%hash) { # checking if hahs exists and contains values
        print "\nhere the hash values: \n";
        foreach (sort keys %hash) {
                print $_, "\n";
        }
}

__DATA__
"@timestamp",url,ts,macin,caid
"Nov 22, 2019 @ 17:19:07.571","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4432595.ts/9836847.3018322401",1574443147021,40EVFVRFB,9836847
"Nov 22, 2019 @ 17:18:59.264","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4431573.ts/0292929.5002731501",1574443138223,BVFEFZZ9C4,0292929
here the regex values:
http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4432595.ts/|9836847|.3018322401"|,|1574443147021|,|40EVFVRFB,9836847
http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4431573.ts/|0292929|.5002731501"|,|1574443138223|,|BVFEFZZ9C4,0292929

here the hash values:
20191122181858|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889
20191122181907|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889
1574443147021|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889
1574443138223|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889
"@timestamp",url,ts,macin,caid
20191122181907|9836847|40EVFVRFB|DIZJ4432595.ts|http://pdzr.rt.pl:8889
20191122181858|0292929|BVFEFZZ9C4|DIZJ4431573.ts|http://pdzr.rt.pl:8889

以下是perl代码:

use strict;
use warnings;
use POSIX qw(strftime);

while (<DATA>) {
    chomp $_;
    s~                          # SUBSTITUTE
        .+?                         # 1 or more any character but newline, not greedy
        (http://[^/]+)              # group 1, URL until the first slash
        .+/                         # 1 or more any character but newline until a slash
        ([^/]+?)                    # group 2, 1 or more non slash
        /[^/]+?,                    # a slash, 1 or more non slash, a comma
        (.+?)                       # group 3, 1 or more any character but newline, not greedy
        ,                           # a comma
        (.+?)                       # group 4, 1 or more any character but newline, not greedy
        ,                           # a comma
        (.+)                        # group 5, 1 or more any character but newline
    ~                           # WITH
        strftime('%Y%m%d%H%M%S',    # convert time
        localtime($3/1000))
        .                           # CONCAT WITH
        "|$5|$4|$2|$1"              # groups 5, 4, 2, 1 joined with pipes
    ~ex;                            # 
    print $_, "\n";
}

__DATA__
"@timestamp",url,ts,macin,caid
"Nov 22, 2019 @ 17:19:07.571","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4432595.ts/9836847.3018322401",1574443147021,40EVFVRFB,9836847
"Nov 22, 2019 @ 17:18:59.264","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4431573.ts/0292929.5002731501",1574443138223,BVFEFZZ9C4,0292929

嗯,有很多方法可以达到同样的效果。Bellow是我的扩展版本,它不仅在字段周围乱洗,而且将它们分离成散列,并对它们进行一些操作[时间戳]

从最初的帖子来看,不清楚时间戳是应该从数据中获取还是在运行时生成——我从数据中获取了时间戳

use strict;
use warnings;

use feature qw(say);

use Data::Dumper;

my $debug = 0;

my %row;
my %url;
my @fields  = qw( timestamp url ts macin caid );
my @address = qw( proto dn port dir id );

while( <DATA> ) {
    next if /timestamp/;

    print if $debug;

    chomp;
    s/,//;
    s/"//g;

    @row{@fields} = split ',';

    print Dumper(\%row) if $debug;

    @url{@address} = ( $row{url} =~ m#(\w+)://(.+):(\d+)/(.+)/(.+)# );

    $url{id}    =~ s/\.\d+//;
    $url{dir}   =~ /(\w+\.ts)/;
    $url{ts}    = $1;

    print Dumper(\%url) if $debug;

    say join('|', (
            timestamp($row{timestamp}),
            $url{id},
            $row{macin},
            $url{ts},
            "$url{proto}://$url{dn}:$url{port}"
            ));

}

sub timestamp {
    my $input = shift;

    my %data;
    my $result;

    my %months = ( Jan => 1, Feb => 2, Mar => 3, Apr => 4,
                   May => 5, Jun => 6, Jul => 7, Aug => 8,
                   Sep => 9, Oct => 10, Nov => 11, Dec => 12
                 );

    my @fields = qw( month day year hour min sec msec ); 

    @data{@fields} = /(\w+)\s+(\d+)\s+(\d+)\s+@\s+(\d+):(\d+):(\d+).(\d+)/;

    print Dumper(\%data) if $debug;

    $result = sprintf "%4d%02d%02d%02d%02d",
                    $data{year},
                    $months{$data{month}},
                    $data{hour},
                    $data{min},
                    $data{sec};

    return $result;
}

__DATA__
"@timestamp",url,ts,macin,caid
"Nov 22, 2019 @ 17:19:07.571","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4432595.ts/9836847.3018322401",1574443147021,40EVFVRFB,9836847
"Nov 22, 2019 @ 17:18:59.264","http://pdzr.rt.pl:8889/qsdf/ZDF/vsLop/DIZJ4431573.ts/0292929.5002731501",1574443138223,BVFEFZZ9C4,0292929

正则表达式没有输出,匹配或替换都有。为什么要切换到替换?是的,正则表达式值来自数据。我想切换到替换以清理行,就像所有行都是相同的格式一样。数据是源,我已经将预期的输出放入哈希中。所以散列就是引用。那么,我想用同样的参考来代替