Java 灾难性回溯&x27;在这个正则表达式上不会发生

Java 灾难性回溯&x27;在这个正则表达式上不会发生,java,regex,backtracking,Java,Regex,Backtracking,有人能解释一下为什么Java的正则表达式引擎在这个正则表达式上会进入灾难性的回溯模式吗?据我所知,每一种变化都是相互排斥的 ^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*| \"(?:[^\"]+|\"\")+\"| '(?:[^']+|'')+') 正文:'pão de açúcar itaucard mastercard白金支持卡]

有人能解释一下为什么Java的正则表达式引擎在这个正则表达式上会进入灾难性的回溯模式吗?据我所知,每一种变化都是相互排斥的

^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
\"(?:[^\"]+|\"\")+\"|
'(?:[^']+|'')+')
正文:
'pão de açúcar itaucard mastercard白金支持卡]

将所有格匹配添加到一些替换中可以解决这个问题,但我不知道为什么-Java的正则表达式库在相互排斥的分支上回溯时会出现严重问题

 ^(?:[^'\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}][^\"\\s~:/@#\\|\\^\\&\\[\\]\\(\\)\\\\\\{\\}]*|
 \"(?:[^\"]++|\"\")++\"|
 '(?:[^']++|'')++')
有人能解释一下为什么java的正则表达式引擎在这个正则表达式上会进入灾难模式吗

对于字符串:

'pão de açúcar itaucard mastercard platinum SUSTENTABILIDADE])
似乎正则表达式的这一部分就是问题所在:

'(?:[^']+|'')+'
匹配第一个
,然后无法匹配结束的
,从而回溯嵌套量词的所有组合

如果允许正则表达式回溯,它将回溯(失败时)。使用原子组和/或所有格量词来防止这种情况


顺便说一句,您不需要该正则表达式中的大部分转义。在字符类(
[]
)中,唯一需要转义的是字符
^-]
。但通常你可以定位它们,这样它们也不需要逃逸。当然,
\
和无论您使用的是什么,字符串仍然需要(双重)转义


我不得不承认,这也让我感到惊讶,但我在RegexBuddy中得到了同样的结果:它在一百万步之后就停止了尝试。我知道关于灾难性回溯的警告往往集中在嵌套的量词上,但根据我的经验,交替至少同样危险。事实上,如果我将正则表达式的最后一部分更改为:

'(?:[^']+|'')+'
……为此:

'(?:[^']*(?:''[^']*)*)'
…它仅在11个步骤中失败。这是“展开循环”技术的一个示例,他将其分解如下:

opening normal * ( special normal * ) * closing
   '     [^']        ''     [^']           '
嵌套的星星是安全的,只要:

  • special
    normal
    不能匹配同一件事
  • special
    始终至少匹配一个字符,并且
  • special
    是原子的(必须只有一种方式才能匹配)

  • 然后,正则表达式将无法匹配最小回溯,而成功时根本没有回溯。另一方面,交替版本几乎可以保证回溯,在不可能匹配的情况下,它会随着目标字符串长度的增加而迅速失控。如果它在某些版本中没有过度回溯,那是因为它们内置了专门针对这个问题的优化功能——到目前为止,只有极少数的版本能够做到这一点。

    编辑:在最后添加了Java版本——尽管它本身就很笨拙、不可读、无法维护


    “再也没有丑陋的图案了! 您需要做的第一件事是以一种能够承受任何可能的希望的方式编写正则表达式,以供人类阅读,并因此而维护。 您需要做的第二件事是对其进行概要分析,以查看它实际上在做什么

    这意味着您至少需要以
    模式.COMMENTS
    模式(或前缀
    “(?x)”
    )编译它,然后添加空格以提供一些可视空间。据我所知,您实际上试图匹配的模式是:

    ^ 
    (?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted
        [^"\s~:/@\#|^&\[\]()\\{}] *
      | " (?: [^"]+ | "" )+ "        # alt 2: double quoted
      | ' (?: [^']+ | '' )+ '        # alt 3: single quoted
    )
    
      | ' (?: [^']+ | '' )+ '        # alt 3: single quoted
    
    正如你所看到的,我在可能的地方引入了垂直和水平空白,以作为一种认知分块来引导眼睛和大脑。我还去掉了你所有多余的反斜杠。这些要么是彻底的错误,要么是混淆器,只会让读者感到困惑

    请注意,当应用垂直空格时,我将从一行到下一行的相同部分放在同一列中,以便您可以立即看到哪些部分相同,哪些部分不同

    这样做之后,我终于可以看到,你在这里看到的是一场锚定在起点的比赛,然后是三种选择。因此,我给这三个备选方案贴上了描述性的标签,这样人们就不用猜了

    我还注意到,您的第一个备选方案有两个微妙不同的(否定的)方括号字符类。第二种方法缺少第一种方法中的单引号排除。这是故意的吗?即使是这样,我发现这对我的口味来说是太多的重复;其中的部分或全部应该在一个变量中,这样您就不会冒更新一致性问题的风险


    轮廓 你必须做的两件事中的第二件,也是更重要的一件,就是分析这件事。您需要确切地看到该模式被编译成什么样的正则表达式程序,并且需要跟踪它在数据上运行时的执行情况

    Java的
    模式
    类目前无法做到这一点,尽管我已经与OraSun当前的代码管理员详细讨论了这一点,他非常希望将这一功能添加到Java中,并认为他完全知道如何做到这一点。他甚至给我发了一个原型来完成第一部分:编译。所以我希望有一天它能上市

    同时,让我们转而使用一种工具,在这种工具中正则表达式是编程语言本身的一个组成部分,而不是作为一种笨拙的事后思考而附加在上面的东西。尽管有几种语言符合这一标准,但没有一种语言的模式匹配技术达到了Perl中的复杂程度

    这是一个等价的程序

    #!/usr/bin/env perl
    use v5.10;      # first release with possessive matches
    use utf8;       # we have utf8 literals
    use strict;     # require variable declarations, etc
    use warnings;   # check for boneheadedness
    
    my $match = qr{
        ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}]
              [^"\s~:/@\#|^&\[\]()\\{}] *
            | " (?: [^"]+ | "" )+ "
            | ' (?: [^']+ | '' )+ '
        )
    }x;
    
    my $text = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";
    
    my $count = 0;
    
    while ($text =~ /$match/g) {
        print "Got it: $&\n";
        $count++;
    }
    
    if ($count == 0) {
        print "Match failed.\n";
    }
    
    如果我们运行该程序,我们会得到预期的答案,即匹配失败。问题是为什么和如何

    现在我们想看两件事:我们想看看模式编译成什么样的正则表达式程序,然后我们想跟踪正则表达式程序的执行情况

    这两个都由

    use re "debug";
    
    pragma,也可以通过
    -Mre=debug
    在命令行上指定。这就是我们要做的
    $ perl -c -Mre=debug /tmp/bt
    Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
    Final program:
       1: BOL (2)
       2: BRANCH (26)
       3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
      14:   STAR (79)
      15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
      26: BRANCH (FAIL)
      27:   TRIE-EXACT["'] (79)
            <"> (29)
      29:     CURLYX[0] {1,32767} (49)
      31:       BRANCH (44)
      32:         PLUS (48)
      33:           ANYOF[\x00-!#-\xff][{unicode_all}] (0)
      44:       BRANCH (FAIL)
      45:         EXACT <""> (48)
      47:       TAIL (48)
      48:     WHILEM[1/2] (0)
      49:     NOTHING (50)
      50:     EXACT <"> (79)
            <'> (55)
      55:     CURLYX[0] {1,32767} (75)
      57:       BRANCH (70)
      58:         PLUS (74)
      59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
      70:       BRANCH (FAIL)
      71:         EXACT <''> (74)
      73:       TAIL (74)
      74:     WHILEM[2/2] (0)
      75:     NOTHING (76)
      76:     EXACT <'> (79)
      78: TAIL (79)
      79: END (0)
    anchored(BOL) minlen 1 
    /tmp/bt syntax OK
    Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
    
    $ perl -Mre=debug /tmp/bt 2>&1 | wc -l
    9987
    
    $ perl -Mre=debug /tmp/bt 2>&1 | wc -l
    167
    
    chars   lines   string
        1     63   ‹'›
        2     78   ‹'p›  
        3    109   ‹'pã›
        4    167   ‹'pão› 
        5    290   ‹'pão ›
        6    389   ‹'pão d›
        7    487   ‹'pão de›
        8    546   ‹'pão de ›
        9    615   ‹'pão de a›
       10    722   ‹'pão de aç›
      ....
       61   9987   ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])›
    
    $ perl -Mre=debug /tmp/bt
    Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o"
    UTF-8 string...
       0 <> <'p%x{e3}o>  |  1:BOL(2)
       0 <> <'p%x{e3}o>  |  2:BRANCH(26)
       0 <> <'p%x{e3}o>  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                                failed...
       0 <> <'p%x{e3}o>  | 26:BRANCH(78)
       0 <> <'p%x{e3}o>  | 27:  TRIE-EXACT["'](79)
       0 <> <'p%x{e3}o>  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
       1 <'> <p%x{e3}o>  |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                                got 1 possible matches
                                TRIE matched word #2, continuing
                                only one match left, short-circuiting: #2 <'>
       1 <'> <p%x{e3}o>  | 55:  CURLYX[0] {1,32767}(75)
       1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                                  whilem: matched 0 out of 1..32767
       1 <'> <p%x{e3}o>  | 57:      BRANCH(70)   1 <'> <p%x{e3}o>          | 58:        PLUS(74)
                                      ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
       5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                        whilem: matched 1 out of 1..32767
       5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
       5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                            ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                            failed...
       5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
       5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                            failed...
                                          BRANCH failed...
                                        whilem: failed, trying continuation...
       5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
       5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                          failed...
                                        failed...
       4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                        whilem: matched 1 out of 1..32767
       4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
       4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                            ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
       5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                              whilem: matched 2 out of 1..32767
       5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
       5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                                  failed...
       5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
       5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                                  failed...
                                                BRANCH failed...
                                              whilem: failed, trying continuation...
       5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
       5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                                failed...
                                              failed...
                                            failed...
       4 <'p%x{e3}> <o>  | 70:            BRANCH(73)
       4 <'p%x{e3}> <o>  | 71:              EXACT <''>(74)
                                            failed...
                                          BRANCH failed...
                                        whilem: failed, trying continuation...
       4 <'p%x{e3}> <o>  | 75:            NOTHING(76)
       4 <'p%x{e3}> <o>  | 76:            EXACT <'>(79)
                                          failed...
                                        failed...
       2 <'p> <%x{e3}o>  | 74:          WHILEM[2/2](0)
                                        whilem: matched 1 out of 1..32767
       2 <'p> <%x{e3}o>  | 57:            BRANCH(70)
       2 <'p> <%x{e3}o>  | 58:              PLUS(74)
                                            ANYOF[\x00-&(-\xff][{unicode_all}] can match 2 times out of 2147483647...
       5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                              whilem: matched 2 out of 1..32767
       5 <'p%x{e3}o> <>  | 57:                  BRANCH(70)
       5 <'p%x{e3}o> <>  | 58:                    PLUS(74)
                                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                                  failed...
       5 <'p%x{e3}o> <>  | 70:                  BRANCH(73)
       5 <'p%x{e3}o> <>  | 71:                    EXACT <''>(74)
                                                  failed...
                                                BRANCH failed...
                                              whilem: failed, trying continuation...
       5 <'p%x{e3}o> <>  | 75:                  NOTHING(76)
       5 <'p%x{e3}o> <>  | 76:                  EXACT <'>(79)
                                                failed...
                                              failed...
       4 <'p%x{e3}> <o>  | 74:                WHILEM[2/2](0)
                                              whilem: matched 2 out of 1..32767
       4 <'p%x{e3}> <o>  | 57:                  BRANCH(70)
       4 <'p%x{e3}> <o>  | 58:                    PLUS(74)
                                                  ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
       5 <'p%x{e3}o> <>  | 74:                      WHILEM[2/2](0)
                                                    whilem: matched 3 out of 1..32767
       5 <'p%x{e3}o> <>  | 57:                        BRANCH(70)
       5 <'p%x{e3}o> <>  | 58:                          PLUS(74)
                                                        ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647.
    ..
                                                        failed...
       5 <'p%x{e3}o> <>  | 70:                        BRANCH(73)
       5 <'p%x{e3}o> <>  | 71:                          EXACT <''>(74)
                                                        failed...
                                                      BRANCH failed...
                                                    whilem: failed, trying continuation...
       5 <'p%x{e3}o> <>  | 75:                        NOTHING(76)
       5 <'p%x{e3}o> <>  | 76:                        EXACT <'>(79)
                                                      failed...
                                                    failed...
                                                  failed...
       4 <'p%x{e3}> <o>  | 70:                  BRANCH(73)
       4 <'p%x{e3}> <o>  | 71:                    EXACT <''>(74)
                                                  failed...
                                                BRANCH failed...
                                              whilem: failed, trying continuation...
       4 <'p%x{e3}> <o>  | 75:                  NOTHING(76)
       4 <'p%x{e3}> <o>  | 76:                  EXACT <'>(79)
                                                failed...
                                              failed...
                                            failed...
       2 <'p> <%x{e3}o>  | 70:            BRANCH(73)
       2 <'p> <%x{e3}o>  | 71:              EXACT <''>(74)
                                            failed...
                                          BRANCH failed...
                                        whilem: failed, trying continuation...
       2 <'p> <%x{e3}o>  | 75:            NOTHING(76)
       2 <'p> <%x{e3}o>  | 76:            EXACT <'>(79)
                                          failed...
                                        failed...
                                      failed...
       1 <'> <p%x{e3}o>  | 70:      BRANCH(73)
       1 <'> <p%x{e3}o>  | 71:        EXACT <''>(74)
                                      failed...
                                    BRANCH failed...
                                  failed...
                                failed...
                              BRANCH failed...
    Match failed
    Match failed.
    Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
    
      | ' (?: [^']+ | '' )+ '        # alt 3: single quoted
    
            <'> (55)
      55:     CURLYX[0] {1,32767} (75)
      57:       BRANCH (70)
      58:         PLUS (74)
      59:           ANYOF[\x00-&(-\xff][{unicode_all}] (0)
      70:       BRANCH (FAIL)
      71:         EXACT <''> (74)
    
       1 <'> <p%x{e3}o>  | 74:    WHILEM[2/2](0)
                                  whilem: matched 0 out of 1..32767
       1 <'> <p%x{e3}o>  | 57:      BRANCH(70)
       1 <'> <p%x{e3}o>  | 58:        PLUS(74)
                                      ANYOF[\x00-&(-\xff][{unicode_all}] can match 3 times out of 2147483647...
       5 <'p%x{e3}o> <>  | 74:          WHILEM[2/2](0)
                                        whilem: matched 1 out of 1..32767
       5 <'p%x{e3}o> <>  | 57:            BRANCH(70)
       5 <'p%x{e3}o> <>  | 58:              PLUS(74)
                                            ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                            failed...
       5 <'p%x{e3}o> <>  | 70:            BRANCH(73)
       5 <'p%x{e3}o> <>  | 71:              EXACT <''>(74)
                                            failed...
                                          BRANCH failed...
                                        whilem: failed, trying continuation...
       5 <'p%x{e3}o> <>  | 75:            NOTHING(76)
       5 <'p%x{e3}o> <>  | 76:            EXACT <'>(79)
                                          failed...
                                        failed...
       4 <'p%x{e3}> <o>  | 74:          WHILEM[2/2](0)
                                        whilem: matched 1 out of 1..32767
       4 <'p%x{e3}> <o>  | 57:            BRANCH(70)
       4 <'p%x{e3}> <o>  | 58:              PLUS(74)
                                            ANYOF[\x00-&(-\xff][{unicode_all}] can match 1 times out of 2147483647...
       5 <'p%x{e3}o> <>  | 74:                WHILEM[2/2](0)
                                              whilem: matched 2 out of 1..32767
    
    ' (?: [^']+ | '' )+ '
    
    ' [^']* '
    
    ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +
        | " [^"]* "
        | ' [^']* '
      )
    
    Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
    Final program:
       1: BOL (2)
       2: BRANCH (26)
       3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
      14:   STAR (61)
      15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
      26: BRANCH (FAIL)
      27:   TRIE-EXACT["'] (61)
            <"> (29)
      29:     STAR (41)
      30:       ANYOF[\x00-!#-\xff][{unicode_all}] (0)
      41:     EXACT <"> (61)
            <'> (46)
      46:     STAR (58)
      47:       ANYOF[\x00-&(-\xff][{unicode_all}] (0)
      58:     EXACT <'> (61)
      60: TAIL (61)
      61: END (0)
    anchored(BOL) minlen 1 
    Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mast
    ercard platinum S"...
    UTF-8 string...
       0 <> <'p%x{e3}o >  |  1:BOL(2)
       0 <> <'p%x{e3}o >  |  2:BRANCH(26)
       0 <> <'p%x{e3}o >  |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                                 failed...
       0 <> <'p%x{e3}o >  | 26:BRANCH(60)
       0 <> <'p%x{e3}o >  | 27:  TRIE-EXACT["'](61)
       0 <> <'p%x{e3}o >  |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
       1 <'> <p%x{e3}o d> |      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                                 got 1 possible matches
                                 TRIE matched word #2, continuing
                                 only one match left, short-circuiting: #2 <'>
       1 <'> <p%x{e3}o d> | 46:  STAR(58)
                                 ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
                                 failed...
                               BRANCH failed...
    Match failed
    Match failed.
    Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
    
       ^ (?: [^'"\s~:/@\#|^&\[\]()\\{}] +    # alt 1: unquoted
           | " (?: [^"]++ | "" )++ "        # alt 2: double quoted
           | ' (?: [^']++ | '' )++ '        # alt 3: single quoted
         )
    
    Compiling REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
    Final program:
       1: BOL (2)
       2: BRANCH (26)
       3:   ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (14)
      14:   STAR (95)
      15:     ANYOF[^\x09\x0a\x0c\x0d "#&()/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl] (0)
      26: BRANCH (FAIL)
      27:   TRIE-EXACT["'] (95)
            <"> (29)
      29:     SUSPEND (58)
      31:       CURLYX[0] {1,32767} (55)
      33:         BRANCH (50)
      34:           SUSPEND (54)
      36:             PLUS (48)
      37:               ANYOF[\x00-!#-\xff][{unicode_all}] (0)
      48:             SUCCEED (0)
      49:           TAIL (53)
      50:         BRANCH (FAIL)
      51:           EXACT <""> (54)
      53:         TAIL (54)
      54:       WHILEM[1/2] (0)
      55:       NOTHING (56)
      56:       SUCCEED (0)
      57:     TAIL (58)
      58:     EXACT <"> (95)
            <'> (63)
      63:     SUSPEND (92)
      65:       CURLYX[0] {1,32767} (89)
      67:         BRANCH (84)
      68:           SUSPEND (88)
      70:             PLUS (82)
      71:               ANYOF[\x00-&(-\xff][{unicode_all}] (0)
      82:             SUCCEED (0)
      83:           TAIL (87)
      84:         BRANCH (FAIL)
      85:           EXACT <''> (88)
      87:         TAIL (88)
      88:       WHILEM[2/2] (0)
      89:       NOTHING (90)
      90:       SUCCEED (0)
      91:     TAIL (92)
      92:     EXACT <'> (95)
      94: TAIL (95)
      95: END (0)
    anchored(BOL) minlen 1 
    Matching REx "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"... against "'p%x{e3}o de a%x{e7}%x{fa}car itaucard mastercard platinum S"...
    UTF-8 string...
       0 <> <'p%x{e3}o > |  1:BOL(2)
       0 <> <'p%x{e3}o > |  2:BRANCH(26)
       0 <> <'p%x{e3}o > |  3:  ANYOF[^\x09\x0a\x0c\x0d "#&-)/:@[-\^{-~][^{unicode}+utf8::IsSpacePerl](14)
                                failed...
       0 <> <'p%x{e3}o > | 26:BRANCH(94)
       0 <> <'p%x{e3}o > | 27:  TRIE-EXACT["'](95)
       0 <> <'p%x{e3}o > |      State:    1 Accepted: N Charid:  2 CP:  27 After State:    3
       1 <'> <p%x{e3}o d>|      State:    3 Accepted: Y Charid:  0 CP:   0 After State:    0
                                got 1 possible matches
                                TRIE matched word #2, continuing
                                only one match left, short-circuiting: #2 <'>
       1 <'> <p%x{e3}o d>| 63:  SUSPEND(92)
       1 <'> <p%x{e3}o d>| 65:    CURLYX[0] {1,32767}(89)
       1 <'> <p%x{e3}o d>| 88:      WHILEM[2/2](0)
                                    whilem: matched 0 out of 1..32767
       1 <'> <p%x{e3}o d>| 67:        BRANCH(84)
       1 <'> <p%x{e3}o d>| 68:          SUSPEND(88)
       1 <'> <p%x{e3}o d>| 70:            PLUS(82)
                                          ANYOF[\x00-&(-\xff][{unicode_all}] can match 60 times out of 2147483647...
      64 <NTABILIDAD])> <| 82:              SUCCEED(0)
                                            subpattern success...
      64 <NTABILIDAD])> <| 88:          WHILEM[2/2](0)
                                        whilem: matched 1 out of 1..32767
      64 <NTABILIDAD])> <| 67:            BRANCH(84)
      64 <NTABILIDAD])> <| 68:              SUSPEND(88)
      64 <NTABILIDAD])> <| 70:                PLUS(82)
                                              ANYOF[\x00-&(-\xff][{unicode_all}] can match 0 times out of 2147483647...
                                              failed...
                                            failed...
      64 <NTABILIDAD])> <| 84:            BRANCH(87)
      64 <NTABILIDAD])> <| 85:              EXACT <''>(88)
                                            failed...
                                          BRANCH failed...
                                        whilem: failed, trying continuation...
      64 <NTABILIDAD])> <| 89:            NOTHING(90)
      64 <NTABILIDAD])> <| 90:            SUCCEED(0)
                                          subpattern success...
      64 <NTABILIDAD])> <| 92:  EXACT <'>(95)
                    failed...
                  BRANCH failed...
    Match failed
    Match failed.
    Freeing REx: "%n    ^ (?: [^'%"\s~:/@\#|^&\[\]()\\{}]%n          [^%"\s~:/"...
    
    $ cat java.crap
    import java.util.regex.*;
    
    public class crap {
    
    public static void
    main(String[ ] argv) {
        String input = "'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])";
        String regex = "\n"
                    + "(?: [^'\"\\s~:/@\\#|^&\\[\\]()\\\\{}]    # alt 1: unquoted         \n"
                    + "    [^\"\\s~:/@\\#|^&\\[\\]()\\\\{}] *                     \n"
                    + "  | \" (?: [^\"]++ | \"\" )++ \"       # alt 2: double quoted   \n"
                    + "  | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   \n"
                    + ")                                                          \n"
                    ;
        System.out.printf("Matching ‹%s› =~ qr{%s}x\n\n", input, regex);
    
        Pattern regcomp = Pattern.compile(regex, Pattern.COMMENTS);
        Matcher regexec = regcomp.matcher(input);
    
        int count;
        for (count = 0; regexec.find(); count++) {
           System.out.printf("Found match: ‹%s›\n", regexec.group());
        }
        if (count == 0) {
            System.out.printf("Match failed.\n");
        }
      }
    }
    
    $ javac -encoding UTF-8 crap.java && java -Dfile.encoding=UTF-8 crap
    Matching ‹'pão de açúcar itaucard mastercard platinum SUSTENTABILIDAD])› =~ qr{
    (?: [^'"\s~:/@\#|^&\[\]()\\{}]    # alt 1: unquoted         
        [^"\s~:/@\#|^&\[\]()\\{}] *                     
      | " (?: [^"]++ | "" )++ "       # alt 2: double quoted   
      | ' (?: [^']++ | '' )++ '       # alt 3: single quoted   
    )                                                          
    }x
    
    Found match: ‹pão›
    Found match: ‹de›
    Found match: ‹açúcar›
    Found match: ‹itaucard›
    Found match: ‹mastercard›
    Found match: ‹platinum›
    Found match: ‹SUSTENTABILIDAD›