Regex 如何基于真实数据自动创建模式？_Regex_Perl_Pattern Matching

Regex 如何基于真实数据自动创建模式？

regex perl

Regex 如何基于真实数据自动创建模式？,regex,perl,pattern-matching,Regex,Perl,Pattern Matching,我有很多数据库供应商，他们在数据的某些方面都有所不同。我想根据以前的数据制定数据验证规则例如： A: XZ-4, XZ-23, XZ-217 B: 1276, 1899, 22711 C: 12-4, 12-75, 12 目标：若用户为供应商B输入字符串“XZ-217”，则算法应比较以前的数据，并说明：此字符串与供应商B以前的数据不相似是否有一些好的方法/工具来实现这种比较？答案可能是一些通用算法或Perl模块编辑：我同意，“相似性”很难定义。但我想使用一种算法，它可以分析之前的CA1

我有很多数据库供应商，他们在数据的某些方面都有所不同。我想根据以前的数据制定数据验证规则

例如：

A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12

目标：若用户为供应商B输入字符串“XZ-217”，则算法应比较以前的数据，并说明：此字符串与供应商B以前的数据不相似

是否有一些好的方法/工具来实现这种比较？答案可能是一些通用算法或Perl模块

编辑：我同意，“相似性”很难定义。但我想使用一种算法，它可以分析之前的CA100样本，然后将分析结果与新数据进行比较。相似性可能基于长度、字符/数字的使用、字符串创建模式、相似的开头/结尾/中间，中间有一些分隔符

我觉得这不是一件容易的事，但另一方面，我认为它有非常广泛的用途。所以我希望，已经有了一些提示。

如果有一个

Tie:：StringApproxHash

模块，它将符合这里的要求

我认为您正在寻找一种结合了的模糊逻辑功能和的哈希接口的东西

前者更为重要；后者将简化编码工作。

您可能需要仔细阅读：

（例如）

这是我的实现和对测试用例的循环。基本上，你给函数一个好值的列表，它试图为它构建一个正则表达式

输出：

A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})

代码：

为了简化查找模式的工作，可选部分可能在最后出现，但可选部分之后可能没有必需的部分。这可能是可以克服的，但可能很困难。

乔尔和我提出了类似的想法。下面的代码区分3种类型的区域

一个或多个非单词字符

字母数字簇

一串数字

它创建字符串和正则表达式的配置文件以匹配输入。此外，它还包含扩展现有概要文件的逻辑。最后，在task sub中，它包含一些伪逻辑，指示如何将其集成到更大的应用程序中

use strict;
use warnings;
use List::Util qw<max min>;

sub compile_search_expr { 
    shift;
    @_ = @{ shift() } if @_ == 1;
    my $str 
        = join( '|'
              , map { join( ''
                           , grep { defined; } 
                             map  {
                                 $_ eq 'P' ? quotemeta;
                               : $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
                               : $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
                               :             undef
                               ;
                            } @$_ 
                          )
                } @_ == 1 ? @{ shift } : @_
        );
    return qr/^(?:$str)$/;
}

sub merge_profiles {
    shift;
    my ( $profile_list, $new_profile ) = @_;
    my $found = 0;
    PROFILE:
    for my $profile ( @$profile_list ) { 
        my $profile_length = @$profile;

        # it's not the same profile.
        next PROFILE unless $profile_length == @$new_profile;
        my @merged;
        for ( my $i = 0; $i < $profile_length; $i++ ) { 
            my $old = $profile->[$i];
            my $new = $new_profile->[$i];
            next PROFILE unless $old->[0] eq $new->[0];
            push( @merged
                , [ $old->[0]
                  , min( $old->[1], $new->[1] )
                  , max( $old->[2], $new->[2] ) 
                  ]);
        }
        @$profile = @merged;
        $found = 1;
        last PROFILE;
    }
    push @$profile_list, $new_profile unless $found;
    return;
}

sub compute_info_profile { 
    shift;
    my @profile_chunks
        = map { 
              /\W/ ? [ P => $_ ]
            : /\D/ ? [ W => length, length ]
            :        [ D => length, length ]
        }
        grep { length; } split /(\W+)/, shift
        ;
}

# Psuedo-Perl
sub process_input_task { 
    my ( $application, $input ) = @_;

    my $patterns = $application->get_patterns_for_current_customer;
    my $regex    = $application->compile_search_expr( $patterns );

    if    ( $input =~ /$regex/ ) {}
    elsif ( $application->approve_divergeance( $input )) {
        $application->merge_profiles( $patterns, compute_info_profile( $input ));
    }
    else { 
        $application->escalate( 
           Incident->new( issue    => INVALID_FORMAT
                        , input    => $input
                        , customer => $customer 
                        ));
    }

    return $application->process_approved_input( $input );
}

使用严格；
使用警告；
使用列表：：Util qw；
子编译搜索表达式{
转移；
@_=@{shift（）}如果@==1；
我的$str
=连接（“|”
，映射{join（''）
，grep{defined；}
地图{
$eq'P'？quotemeta；
：$\uEQ'W'？“\\W{$\u->[1]，$\u->[2]}”
：$\uEQ'D'？“\\D{$\u->[1]，$\u->[2]}”
：未定义
;
} @$_ 
)
}@==1？@{shift}：@_
);
返回qr/^（？$str）$/；
}
子合并配置文件{
转移；
我的（$profile\u list，$new\u profile）=@；
我的$found=0；
轮廓：
对于我的$profile（@$profile_列表）{
我的$profile_length=@$profile；
#这不是同一个人资料。
下一个配置文件，除非$PROFILE_length=@$new_PROFILE；
我的@merge；
对于（my$i=0；$i<$profile_length；$i++）{
我的$old=$profile->[$i]；
my$new=$new_profile->[$i]；
下一个配置文件，除非$old->[0]eq$new->[0]；
推送（@merged）
，[$old->[0]
，最小（$old->[1]，$new->[1]）
，最大（$old->[2]，$new->[2]）
]);
}
@$profile=@merged；
$found=1；
最后一个配置文件；
}
推送@$profile_list，$new_profile，除非找到$；
返回；
}
子计算信息配置文件{
转移；
我的@profile\u块
=映射{
/\W/？[P=>$\uux]
：/\D/？[W=>长度，长度]
：[D=>长度，长度]
}
grep{length；}split/（\W+/），shift
;
}
#Psuedo-Perl
子进程\输入\任务{
我的（$application，$input）=@；
my$patterns=$application->获取当前客户的模式；
my$regex=$application->compile\u search\u expr（$patterns）；
如果（$input=~/$regex/）{}
elsif（$application->approve_disference（$input））{
$application->merge_profile（$patterns，compute_info_profile（$input））；
}
否则{
$application->escalate（
事件->新建（问题=>无效的\u格式
，输入=>$input
，customer=>$customer
));
}
返回$application->process\u approved\u输入（$input）；
}

这真的很模糊。试着定义一些类似的东西。除非你们给他们精确的规则，否则电脑不能说“呃，看起来够近了”。例如，您可能希望“有超过X个共同字符”或“以相同的Y字符开头”或“中间有相同的符号（如破折号）”。这将非常困难，除非您可以施加一些额外的限制。思考：如何避免您的模式学习算法决定使用

qr/*/

？

use strict;
use warnings;
use List::Util qw<max min>;

sub compile_search_expr { 
    shift;
    @_ = @{ shift() } if @_ == 1;
    my $str 
        = join( '|'
              , map { join( ''
                           , grep { defined; } 
                             map  {
                                 $_ eq 'P' ? quotemeta;
                               : $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
                               : $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
                               :             undef
                               ;
                            } @$_ 
                          )
                } @_ == 1 ? @{ shift } : @_
        );
    return qr/^(?:$str)$/;
}

sub merge_profiles {
    shift;
    my ( $profile_list, $new_profile ) = @_;
    my $found = 0;
    PROFILE:
    for my $profile ( @$profile_list ) { 
        my $profile_length = @$profile;

        # it's not the same profile.
        next PROFILE unless $profile_length == @$new_profile;
        my @merged;
        for ( my $i = 0; $i < $profile_length; $i++ ) { 
            my $old = $profile->[$i];
            my $new = $new_profile->[$i];
            next PROFILE unless $old->[0] eq $new->[0];
            push( @merged
                , [ $old->[0]
                  , min( $old->[1], $new->[1] )
                  , max( $old->[2], $new->[2] ) 
                  ]);
        }
        @$profile = @merged;
        $found = 1;
        last PROFILE;
    }
    push @$profile_list, $new_profile unless $found;
    return;
}

sub compute_info_profile { 
    shift;
    my @profile_chunks
        = map { 
              /\W/ ? [ P => $_ ]
            : /\D/ ? [ W => length, length ]
            :        [ D => length, length ]
        }
        grep { length; } split /(\W+)/, shift
        ;
}

# Psuedo-Perl
sub process_input_task { 
    my ( $application, $input ) = @_;

    my $patterns = $application->get_patterns_for_current_customer;
    my $regex    = $application->compile_search_expr( $patterns );

    if    ( $input =~ /$regex/ ) {}
    elsif ( $application->approve_divergeance( $input )) {
        $application->merge_profiles( $patterns, compute_info_profile( $input ));
    }
    else { 
        $application->escalate( 
           Incident->new( issue    => INVALID_FORMAT
                        , input    => $input
                        , customer => $customer 
                        ));
    }

    return $application->process_approved_input( $input );
}