Regex Grep搜索某些字符任意顺序任意大小写

Regex Grep搜索某些字符任意顺序任意大小写,regex,grep,Regex,Grep,我需要搜索“詹姆斯”这个角色。问题是它们可以是任意顺序的,任何一个都可以资本化。例如,需要找到以下内容 Aemjs 埃马伊斯 塞马吉 这只是一些可能性,显然还有更多的组合 如果可能,这需要用一条grep语句来完成。grep搜索是使用软件包进行的,而不是在unix机器上。输入仅接受一个grep命令。可以用一句话来表达吗?这里有一个有趣的正则表达式: / (?=.{0,4}j)(?=.{0,4}a)(?=.{0,4}m)(?=.{0,4}e)(?=.{0,4}s).{5} /i 这将要做的是向

我需要搜索“詹姆斯”这个角色。问题是它们可以是任意顺序的,任何一个都可以资本化。例如,需要找到以下内容

  • Aemjs
  • 埃马伊斯
  • 塞马吉
这只是一些可能性,显然还有更多的组合


如果可能,这需要用一条grep语句来完成。grep搜索是使用软件包进行的,而不是在unix机器上。输入仅接受一个grep命令。可以用一句话来表达吗?

这里有一个有趣的正则表达式:

/ (?=.{0,4}j)(?=.{0,4}a)(?=.{0,4}m)(?=.{0,4}e)(?=.{0,4}s).{5} /i
这将要做的是向前看,以确保字符串“james”中的每个字符在接下来的5个字符内匹配,并且
i
修饰符使其不区分大小写

将其放入
grep
,您会得到如下结果:

grep -Pi " (?=.{0,4}j)(?=.{0,4}a)(?=.{0,4}m)(?=.{0,4}e)(?=.{0,4}s).{5} " $file
.....&.*D.*&.*O.*&.*L.*L.*&.*Y.*

其中
$file
是要
grep
通过的文件。请注意,
-P
标志需要GNU
grep
,并指示该模式是Perl风格的正则表达式(本机
grep
正则表达式,即使使用
-E
,也不支持lookaheads)。
-i
标志使其不区分大小写。

让我们在Lisp中编写一个小的a grep程序,它用插入符号在匹配项下划线:

#!/usr/local/bin/txr --lisp
(let ((regex (regex-compile (first *args*))))
  (whilet ((line (get-line)))
    (whenlet ((mlist (rra regex line))) ;; rra: regex ranges all
      (put-line line)
      (let ((carets (mkstring (to (find-max mlist)) #\space)))
        (mapdo (op mapdo (do set [carets @1] #\^) (range* (from @1) (to @1)))
               mlist)
        (put-line carets)))))
运行(仅大写;添加小写不重要):

正则表达式仅表示由五个字符(
..
)组成的字符串集,其中包含一个J(
*J.*
),并且类似地包含一个
a
),依此类推

如果我们在一个单词中有重复的字母,并且它们都必须存在,比如说,DOLLY,它会是这样的:

grep -Pi " (?=.{0,4}j)(?=.{0,4}a)(?=.{0,4}m)(?=.{0,4}e)(?=.{0,4}s).{5} " $file
.....&.*D.*&.*O.*&.*L.*L.*&.*Y.*
两个
L
包含在
*L.*L.*
中,它匹配至少包含两个L-s的字符串集。如果我们有至少两个L-s,至少一个D,至少一个O,至少一个Y,并且长度是五个字符,那么我们一定有一堆多利

..&.*J.&.*A.*&.*M.*.&.*E.*.&.*S.
开始,我们可以做一些代数来消除奇异的
&
运算符吗?如果我们可以用代数方法将其简化为一个可管理的普通正则表达式,只需使用分支、连接等,我们就可以使用普通工具(我的意思是没有愚蠢的Perl扩展或任何东西:只需使用旧的NFA正则表达式)

一个大的连词立即暗示了德摩根定律(
A&B)(~A | ~B)
),它引入了否定。我们能消除否定吗

--> ~(~.....|~.*J.*|~.*A.*|~.*M.*|~.*E.*|~.*S.*)
啊哈!首先,
~…
表示“所有字符串长度不超过五个字符”。这很简单:它只是包含以下内容的集合:空字符串、所有一个字符长的字符串、所有两个字符长的字符串。。。。不是任何五个字符长的字符串,六个字符长的字符串等等。没有
~
我们可以很容易地表达:

(|.|..|...|....|......+)
接下来,由
~.*J.*
表示的集合就是不包含
J
的字符串集合。轻松使用角色类!只是
[^J]*
!好的,我们可以把它代入正则表达式,然后我们得到的是一个大的否定:

~(|.|..|...|....|......+|[^J]*|[^A]*|[^M]*|[^E]*|[^S]*)
顺便说一句,让我们检查一下这是否仍然有效:

$ ./txgrep '~(|.|..|...|....|......+|[^J]*|[^A]*|[^M]*|[^E]*|[^S]*)'
JAMES
JAMES
^^^^^
Hey there, JAMES, meet AMSEJAMS.
Hey there, JAMES, meet AMSEJAMS.
           ^^^^^       ^^^^^
SJAMSSEMASMSJEMSAMSESAMJESESJASMAS
SJAMSSEMASMSJEMSAMSESAMJESESJASMAS
            ^^^^^  ^^^^^
J
AJAMES
AJAMES
 ^^^^^
显然,是的。呸

我们能从这里走到哪里

这里有一个疯狂的想法:假设我们允许自己使用单词锚定。然后我们可以使用
grep-v
找到包含候选密码的行吗

$ grep -v -E '\<(|.|..|...|....|......+|[^J]*|[^A]*|[^M]*|[^E]*|[^S]*)\>'
grep
(POSIX特性)中的大小写不敏感处理了大量组合爆炸。实际上,如果没有
-i
,我们所要做的就是用类
[Jj]
替换像
J
这样的每个字母。这使得我们的模式大了好几倍,但仍然可以合理地管理

从上面开始,我们可以折叠路径压缩。例如,
JAMES
的几个混搭,实际上是六个,通过以下方式匹配:

J(A(M(ES|SE)|E(MS|SM)|S(ME|EM)))
这比把六个字全部写出来要短一点:

JAMES|JAMSE|JAEMS|JAESM|JASME|JASEM
现在请注意,生成这六个后缀排列的
M(ES | SE)| E(MS | SM)| S(ME | EM)
部分可以分解出来并应用于前缀
JA
以及
AJ

(AJ|JA)(M(ES|SE)|E(MS|SM)|S(ME|EM))
看,我们现在有12场比赛。这已经是排列空间的10%。这里有一种模式,我们采用了字符串的一种特殊排列,即
JAMES
,并(任意)将其分成两部分,即
JA
MES
。然后我们将这些片段分别排列为
(AJ | JA)
(两种方式)和
M(ES | SE)| E(MS | SM)| S(ME | EM)
(六种方式),将它们关联起来,我们有十二种方式

我们就不能重复十次,然后把所有的120都搞定吗?我们当然可以!从一组5个字母中选择两个字母有十种方法:

JA、JM、JE、JS、AM、AE、AS、ME、MS、ES

每一种都有两种匹配方式,因此有二十种可能性。其中每一个都与剩余字母的六种排列相匹配:20*6=120。对于每个有向图,我们只需根据上述模式写下一个正则表达式,覆盖120个正则表达式中的12个:

(AJ|JA)(M(ES|SE)|E(MS|SM)|S(ME|EM))
(JM|MJ)(A(ES|SE)|E(AS|SA)|S(EA|AE))
... eight more
把这些加上“|”就完成了。长度为10*35+9=359个字符

这比原始的原始未压缩正则表达式(719个字符长)小得多

事实上,719是2*359+1.)

补充方法:

在TXR中,有一个用于某些过滤功能的trie模块。我们可以做的是将
JAMES
的所有排列添加到一个trie中,然后将其转换为一个trie压缩正则表达式。但是,转换为正则表达式的函数不存在。没关系,我们搞定了。函数将trie转换为regex抽象语法:由Lisp s表达式组成的树。然后我们可以用
regex compile
编译它。作为一个副作用,生成的对象有一个以regex字符语法呈现的打印表示(这就是我们最终要做的:查看该形式):

;; This produces S-exp based regex abstract syntax (AST) not ;; regex character syntax. (defun trie-to-regex (trie) (typecase trie (null nil) (cons ^(compound ,(car trie) ,(trie-to-regex (cdr trie)))) (hash (iflet ((pairs (hash-pairs trie))) (reduce-left (ret ^(or ,@1 ,@2)) (mapcar (aret ^(compound ,@1 ,(trie-to-regex @2))) pairs))))))

$ txr -i trie-to-regex.tl
1> (defvar tr (make-trie))
tr
2> (perm "JAMES")
("JAMES" "JAMSE" "JAEMS" "JAESM" "JASME" "JASEM" "JMAES" "JMASE"
 "JMEAS" "JMESA" "JMSAE" "JMSEA" "JEAMS" "JEASM" "JEMAS" "JEMSA"
 "JESAM" "JESMA" "JSAME" "JSAEM" "JSMAE" "JSMEA" "JSEAM" "JSEMA"
 "AJMES" "AJMSE" "AJEMS" "AJESM" "AJSME" "AJSEM" "AMJES" "AMJSE"
 "AMEJS" "AMESJ" "AMSJE" "AMSEJ" "AEJMS" "AEJSM" "AEMJS" "AEMSJ"
 "AESJM" "AESMJ" "ASJME" "ASJEM" "ASMJE" "ASMEJ" "ASEJM" "ASEMJ"
 "MJAES" "MJASE" "MJEAS" "MJESA" "MJSAE" "MJSEA" "MAJES" "MAJSE"
 "MAEJS" "MAESJ" "MASJE" "MASEJ" "MEJAS" "MEJSA" "MEAJS" "MEASJ"
 "MESJA" "MESAJ" "MSJAE" "MSJEA" "MSAJE" "MSAEJ" "MSEJA" "MSEAJ"
 "EJAMS" "EJASM" "EJMAS" "EJMSA" "EJSAM" "EJSMA" "EAJMS" "EAJSM"
 "EAMJS" "EAMSJ" "EASJM" "EASMJ" "EMJAS" "EMJSA" "EMAJS" "EMASJ"
 "EMSJA" "EMSAJ" "ESJAM" "ESJMA" "ESAJM" "ESAMJ" "ESMJA" "ESMAJ"
 "SJAME" "SJAEM" "SJMAE" "SJMEA" "SJEAM" "SJEMA" "SAJME" "SAJEM"
 "SAMJE" "SAMEJ" "SAEJM" "SAEMJ" "SMJAE" "SMJEA" "SMAJE" "SMAEJ"
 "SMEJA" "SMEAJ" "SEJAM" "SEJMA" "SEAJM" "SEAMJ" "SEMJA" "SEMAJ")
3> (mapdo (op trie-add tr @1 t) (perm "JAMES")) ;; add above to trie
nil
4> (regex-compile (trie-to-regex tr)) ;; compile, get printed rep as side effect
#/A(E(J(MS|SM)|M(JS|SJ)|S(JM|MJ))|J(E(MS|SM)|M(ES|SE)|S(EM|ME))|M(E(JS|SJ)|J(ES|SE)|S(EJ|JE))|S(E(JM|MJ)|J(EM|ME)|M(EJ|JE)))|
E(A(J(MS|SM)|M(JS|SJ)|S(JM|MJ))|J(A(MS|SM)|M(AS|SA)|S(AM|MA))|M(A(JS|SJ)|J(AS|SA)|S(AJ|JA))|S(A(JM|MJ)|J(AM|MA)|M(AJ|JA)))|
J(A(E(MS|SM)|M(ES|SE)|S(EM|ME))|E(A(MS|SM)|M(AS|SA)|S(AM|MA))|M(A(ES|SE)|E(AS|SA)|S(AE|EA))|S(A(EM|ME)|E(AM|MA)|M(AE|EA)))|
M(A(E(JS|SJ)|J(ES|SE)|S(EJ|JE))|E(A(JS|SJ)|J(AS|SA)|S(AJ|JA))|J(A(ES|SE)|E(AS|SA)|S(AE|EA))|S(A(EJ|JE)|E(AJ|JA)|J(AE|EA)))|
S(A(E(JM|MJ)|J(EM|ME)|M(EJ|JE))|E(A(JM|MJ)|J(AM|MA)|M(AJ|JA))|J(A(EM|ME)|E(AM|MA)|M(AE|EA))|M(A(EJ|JE)|E(AJ|JA)|J(AE|EA)))/
/\b(?=\w*j)(?=\w*a)(?=\w*m)(?=\w*e)(?=\w*s)\w{5}\b/i
(defun break-trigraphs (string)
  (if (<= (length string) 3)
    string
    (mapcar (ret (list @1 (break-trigraphs (set-diff string @1))))
            (comb string 3))))

(defun trigraph-tree-to-regex (dtree)
  (typecase dtree
    (str (caseql (length dtree)
           (1 dtree)
           ((2 3) (reduce-right (ret ^(or ,@1 ,@2)) (perm dtree)))
           (t (error "bad trigraph tree"))))
    (cons
      (whenlet ((exprs (collect-each ((elem dtree))
                         ^(compound ,(trigraph-tree-to-regex (first elem))
                                    ,(trigraph-tree-to-regex (second elem))))))
        (reduce-right (ret ^(or ,@1 ,@2)) exprs)))))
$ txr -i trigraphs.tl 
1> (break-trigraphs "JAMES")
(("JAM" "ES") ("JAE" "MS") ("JAS" "ME") ("JME" "AS") ("JMS" "AE")
 ("JES" "AM") ("AMS" "JE") ("AES" "JM") ("MES" "JA"))
2> (trigraph-tree-to-regex *1)
(or (compound (or "JAM" (or "JMA" (or "AJM" (or "AMJ" (or "MJA" "MAJ")))))
     (or "ES" "SE"))
  (or (compound (or "JAE" (or "JEA" (or "AJE" (or "AEJ" (or "EJA" "EAJ")))))
       (or "MS" "SM"))
    (or (compound (or "JAS" (or "JSA" (or "AJS" (or "ASJ" (or "SJA" "SAJ")))))
         (or "ME" "EM"))
      (or (compound (or "JME" (or "JEM" (or "MJE" (or "MEJ" (or "EJM" "EMJ")))))
           (or "AS" "SA"))
        (or (compound (or "JMS" (or "JSM" (or "MJS" (or "MSJ" (or "SJM" "SMJ")))))
             (or "AE" "EA"))
          (or (compound (or "JES" (or "JSE" (or "EJS" (or "ESJ" (or "SJE" "SEJ")))))
               (or "AM" "MA"))
            (or (compound (or "AMS" (or "ASM" (or "MAS" (or "MSA" (or "SAM" "SMA")))))
                 (or "JE" "EJ"))
              (or (compound (or "AES" (or "ASE" (or "EAS" (or "ESA" (or "SAE" "SEA")))))
                   (or "JM" "MJ"))
                (compound (or "MES" (or "MSE" (or "EMS" (or "ESM" (or "SME" "SEM")))))
                 (or "JA" "AJ"))))))))))
3> (regex-compile *2)
#/(JAM|JMA|AJM|AMJ|MJA|MAJ)(ES|SE)|(JAE|JEA|AJE|AEJ|EJA|EAJ)(MS|SM)|(JAS|JSA|AJS|ASJ|SJA|SAJ)(ME|EM)|(JME|JEM|MJE|MEJ|EJM|EMJ)(AS|SA)|(JMS|JSM|MJS|MSJ|SJM|SMJ)(AE|EA)|(JES|JSE|EJS|ESJ|SJE|SEJ)(AM|MA)|(AMS|ASM|MAS|MSA|SAM|SMA)(JE|EJ)|(AES|ASE|EAS|ESA|SAE|SEA)(JM|MJ)|(MES|MSE|EMS|ESM|SME|SEM)(JA|AJ)/