正则表达式字符类的双重否定中的错误？更新：在Java11中，下面描述的bug似乎已经修复（可能它在更早的时候就被修复了，但我不知道具体是哪个版本。关于Java 9中链接的类似问题）。_Java_Regex

正则表达式字符类的双重否定中的错误？更新：在Java11中，下面描述的bug似乎已经修复（可能它在更早的时候就被修复了，但我不知道具体是哪个版本。关于Java 9中链接的类似问题）。

java regex

正则表达式字符类的双重否定中的错误？更新：在Java11中，下面描述的bug似乎已经修复（可能它在更早的时候就被修复了，但我不知道具体是哪个版本。关于Java 9中链接的类似问题）。,java,regex,Java,Regex,TL；DR（修复前）：为什么[^\\D2]，[^[^0-9]2]，[^2[^0-9]]在Java中得到不同的结果用于测试的代码。你现在可以跳过它 String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" }; String[] tests = {

TL；DR（修复前）：
为什么

[^\\D2]

，

[^[^0-9]2]

，

[^2[^0-9]]

在Java中得到不同的结果

用于测试的代码。你现在可以跳过它

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));

假设我需要正则表达式，它将接受

不是数字
除了
```
2
```

因此，这样的正则表达式应该表示除

、

、<代码>9。我至少可以用两种方式来写，这两种方式是所有不是以2为单位的数字的总和：

```
[^0-9]2]
```
```
[\\D2]
```

这两个正则表达式都按预期工作

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true

现在让我们假设我想要反转接受的字符。（因此我想要接受除2以外的所有数字）我可以创建regex，它显式地包含所有接受的字符，比如

```
[013-9]
```

或者尝试用另一个

[^…]

来否定前面描述的两个正则表达式，如

```
[^\\D2]
```
```
[^[^0-9]2]
```
甚至
```
[^2[^0-9]]
```

但令我惊讶的是，只有前两个版本能按预期工作

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true

所以我的问题是为什么

[^0-9]2]

或

[^2[^0-9]]

的行为不像

[^\D2]

？我能否以某种方式更正这些正则表达式，以便能够在它们内部使用

[^0-9]

？

根据嵌套类生成两个类的并集，这使得使用该符号无法创建交集：

要创建联合，只需将一个类嵌套在另一个类中，例如[0-4[6-8]]。这个特定的并集创建一个与数字0、1、2、3、4、6、7和8匹配的单个字符类

要创建交叉口，您必须使用

&&

：

要创建仅匹配其所有嵌套类的公共字符的单个字符类，请使用&&，如[0-9&&[345]]中所示。此特定交集创建一个仅与两个字符类（3、4和5）共有的数字匹配的单个字符类

你问题的最后一部分对我来说也是个谜。

[^2]

和

[^0-9]

的联合实际上应该是

[^2]

，因此

[^2[^0-9]]

的行为符合预期<代码>[^[^0-9]2]表现得像

[^0-9]

确实很奇怪。

在Oracle实现的

模式类的字符类解析代码中有一些奇怪的巫毒，如果您从Oracle网站下载JRE/JDK或使用OpenJDK，JRE/JDK会附带这些代码。我没有检查其他JVM（特别是）实现如何解析问题中的正则表达式
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

从这一点来看，对模式
类及其内部工作的任何引用都严格限于Oracle的实现（参考实现）
阅读和理解Pattern
class如何解析问题中所示的嵌套否定将需要一些时间。但是，我编写了一个程序1，从模式对象（with）中提取信息，以查看编译结果。下面的输出来自在Java HotSpot客户端VM 1.7.0_51版上运行我的程序
1:现在节目搞得一团糟。当我完成并重构这篇文章时，我会用一个链接来更新它
这并不奇怪
[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

上面接下来的两个案例被编译成与[^0-9]
相同的程序，这是违反直觉的
[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

由于相同的错误，正则表达式[^2[^0-9]]
编译为与[^2[^0-9]]
相同的程序
有一个未解决的bug似乎具有相同的性质：

解释
初步的
下面是Pattern
类的实现细节，在进一步阅读之前应该了解这些细节：

模式
类将一个字符串
编译成一个节点链，每个节点负责一个小的、定义明确的职责，并将工作委托给链中的下一个节点<代码>节点类是所有节点的基类
CharProperty
类是所有与节点相关的字符类的基类

BitClass
class是CharProperty
类的一个子类，它使用boolean[]
数组来加速拉丁字符1的匹配（代码点感谢您的回答。这也是我开始时的想法。它看起来像[^2[^0-9]
[^2]
是先创建的，后来正则表达式引擎使用union将其与[^0-9]
组合，所以它不会改变任何东西，因为这两个类的总和是[^2]
（[^0-9]
是[^2]
的子集）。我感到困扰的是为什么[^0-9]2
的行为与[^0-9]
相同，而不是[^2]
？@Pshemo:更新了一点答案。我在想，javadoc中的所有示例都将嵌套类作为最后一个元素。如果不遵循该约定，行为会有点未定义吗？这很令人费解。如果[]
是外部否定字符类的第一个元素，而不是使用联合交集。DeMorgan可能是这里的罪魁祸首，但我不确定如何将他与这种情况联系起来。更让我困惑的是，[^[^0-9]2]
，[^[^[^0-9]]2]
和[^[^[^[^[^0-9]]2]
所有这些都会产生相同的结果。我尝试查看代码，但不太容易理解。@Keppil甚至支持多个嵌套级别？只是想让您知道[^[^0-9]]
与不同
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

private CharProperty clazz(boolean consume) {
    // [Declaration and initialization of local variables - OMITTED]
    BitClass bits = new BitClass();
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    // [CODE OMITTED]
                    ch = next();
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                // [CODE OMITTED]
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                continue;
            case 0:
                // [CODE OMITTED]
                // Unclosed character class is checked here
                break;
            case ']':
                // [CODE OMITTED]
                // The only return statement in this method
                // is in this case
                break;
            default:
                // [CODE OMITTED]
                break;
        }
        node = range(bits);

        // [CODE OMITTED]
        ch = peek();
    }
}

private CharProperty clazz(boolean consume) {
    CharProperty prev = null;
    CharProperty node = null;
    BitClass bits = new BitClass();
    boolean include = true;
    boolean firstInClass = true;
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    if (temp[cursor-1] != '[')
                        break;
                    ch = next();
                    include = !include;
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                firstInClass = false;
                node = clazz(true);
                if (prev == null)
                    prev = node;
                else
                    prev = union(prev, node);
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                // There are interesting things (bugs) here,
                // but it is not relevant to the discussion.
                continue;
            case 0:
                firstInClass = false;
                if (cursor >= patternLength)
                    throw error("Unclosed character class");
                break;
            case ']':
                firstInClass = false;

                if (prev != null) {
                    if (consume)
                        next();

                    return prev;
                }
                break;
            default:
                firstInClass = false;
                break;
        }
        node = range(bits);

        if (include) {
            if (prev == null) {
                prev = node;
            } else {
                if (prev != node)
                    prev = union(prev, node);
            }
        } else {
            if (prev == null) {
                prev = node.complement();
            } else {
                if (prev != node)
                    prev = setDifference(prev, node);
            }
        }
        ch = peek();
    }
}