Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/313.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
正则表达式字符类的双重否定中的错误? 更新:在Java11中,下面描述的bug似乎已经修复 (可能它在更早的时候就被修复了,但我不知道具体是哪个版本。关于Java 9中链接的类似问题)。_Java_Regex - Fatal编程技术网

正则表达式字符类的双重否定中的错误? 更新:在Java11中,下面描述的bug似乎已经修复 (可能它在更早的时候就被修复了,但我不知道具体是哪个版本。关于Java 9中链接的类似问题)。

正则表达式字符类的双重否定中的错误? 更新:在Java11中,下面描述的bug似乎已经修复 (可能它在更早的时候就被修复了,但我不知道具体是哪个版本。关于Java 9中链接的类似问题)。,java,regex,Java,Regex,TL;DR(修复前): 为什么[^\\D2],[^[^0-9]2],[^2[^0-9]]在Java中得到不同的结果 用于测试的代码。你现在可以跳过它 String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" }; String[] tests = {

TL;DR(修复前):
为什么
[^\\D2]
[^[^0-9]2]
[^2[^0-9]]
在Java中得到不同的结果


用于测试的代码。你现在可以跳过它

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));

假设我需要正则表达式,它将接受

  • 不是数字
  • 除了
    2
因此,这样的正则表达式应该表示除
0
1
3
4
、<代码>9。我至少可以用两种方式来写,这两种方式是所有不是以2为单位的数字的总和:

  • [^0-9]2]
  • [\\D2]
这两个正则表达式都按预期工作

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true
match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true 
现在让我们假设我想要反转接受的字符。(因此我想要接受除2以外的所有数字) 我可以创建regex,它显式地包含所有接受的字符,比如

  • [013-9]
或者尝试用另一个
[^…]
来否定前面描述的两个正则表达式,如

  • [^\\D2]
  • [^[^0-9]2]

    甚至
  • [^2[^0-9]]
但令我惊讶的是,只有前两个版本能按预期工作

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true
match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true 
所以我的问题是为什么
[^0-9]2]
[^2[^0-9]]
的行为不像
[^\D2]
?我能否以某种方式更正这些正则表达式,以便能够在它们内部使用
[^0-9]

根据嵌套类生成两个类的并集,这使得使用该符号无法创建交集:

要创建联合,只需将一个类嵌套在另一个类中,例如[0-4[6-8]]。这个特定的并集创建一个与数字0、1、2、3、4、6、7和8匹配的单个字符类

要创建交叉口,您必须使用
&&

要创建仅匹配其所有嵌套类的公共字符的单个字符类,请使用&&,如[0-9&&[345]]中所示。此特定交集创建一个仅与两个字符类(3、4和5)共有的数字匹配的单个字符类


你问题的最后一部分对我来说也是个谜。
[^2]
[^0-9]
的联合实际上应该是
[^2]
,因此
[^2[^0-9]]
的行为符合预期<代码>[^[^0-9]2]表现得像
[^0-9]
确实很奇怪。

在Oracle实现的
模式
类的字符类解析代码中有一些奇怪的巫毒,如果您从Oracle网站下载JRE/JDK或使用OpenJDK,JRE/JDK会附带这些代码。我没有检查其他JVM(特别是)实现如何解析问题中的正则表达式

[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
从这一点来看,对
模式
类及其内部工作的任何引用都严格限于Oracle的实现(参考实现)

阅读和理解
Pattern
class如何解析问题中所示的嵌套否定将需要一些时间。但是,我编写了一个程序1,从
模式
对象(with)中提取信息,以查看编译结果。下面的输出来自在Java HotSpot客户端VM 1.7.0_51版上运行我的程序

1:现在节目搞得一团糟。当我完成并重构这篇文章时,我会用一个链接来更新它

这并不奇怪

[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
上面接下来的两个案例被编译成与
[^0-9]
相同的程序,这是违反直觉的

[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match
由于相同的错误,正则表达式
[^2[^0-9]]
编译为与
[^2[^0-9]]
相同的程序

有一个未解决的bug似乎具有相同的性质:


解释 初步的 下面是
Pattern
类的实现细节,在进一步阅读之前应该了解这些细节:

  • 模式
    类将一个
    字符串
    编译成一个节点链,每个节点负责一个小的、定义明确的职责,并将工作委托给链中的下一个节点<代码>节点
类是所有节点的基类
  • CharProperty
    类是所有与
    节点相关的字符类的基类

  • BitClass
    class是
    CharProperty
    类的一个子类,它使用
    boolean[]
    数组来加速拉丁字符1的匹配(代码点感谢您的回答。这也是我开始时的想法。它看起来像
    [^2[^0-9]
    [^2]
    是先创建的,后来正则表达式引擎使用union将其与
    [^0-9]
    组合,所以它不会改变任何东西,因为这两个类的总和是
    [^2]
    [^0-9]
    [^2]
    的子集)。我感到困扰的是为什么
    [^0-9]2
    的行为与
    [^0-9]
    相同,而不是
    [^2]
    ?@Pshemo:更新了一点答案。我在想,javadoc中的所有示例都将嵌套类作为最后一个元素。如果不遵循该约定,行为会有点未定义吗?这很令人费解。如果
    []
    是外部否定字符类的第一个元素,而不是使用联合交集。DeMorgan可能是这里的罪魁祸首,但我不确定如何将他与这种情况联系起来。更让我困惑的是,
    [^[^0-9]2]
    [^[^[^0-9]]2]
    [^[^[^[^[^0-9]]2]
    所有这些都会产生相同的结果。我尝试查看代码,但不太容易理解。@Keppil甚至支持多个嵌套级别?只是想让您知道
    [^[^0-9]]
    不同
    
    [013-9]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
        [U+0030][U+0031]
        01
      Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    
    [^\D2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
          Ctype. Match POSIX character class DIGIT (US-ASCII)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    
    [^[^0-9]2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    
    [^[^[^0-9]]2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    
    [^[^[^[^0-9]]]2]
    Start. Start unanchored match (minLength=1)
    Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
      BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
        [U+0032]
        2
    LastNode
    Node. Accept match
    
    [^2[^0-9]]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
          [U+0032]
          2
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    
    [^2[^[^0-9]]]
    Start. Start unanchored match (minLength=1)
    Pattern.union (character class union). Match any character matched by either character classes below:
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
          [U+0032]
          2
      CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
        Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
    LastNode
    Node. Accept match
    
    private CharProperty clazz(boolean consume) {
        // [Declaration and initialization of local variables - OMITTED]
        BitClass bits = new BitClass();
        int ch = next();
        for (;;) {
            switch (ch) {
                case '^':
                    // Negates if first char in a class, otherwise literal
                    if (firstInClass) {
                        // [CODE OMITTED]
                        ch = next();
                        continue;
                    } else {
                        // ^ not first in class, treat as literal
                        break;
                    }
                case '[':
                    // [CODE OMITTED]
                    ch = peek();
                    continue;
                case '&':
                    // [CODE OMITTED]
                    continue;
                case 0:
                    // [CODE OMITTED]
                    // Unclosed character class is checked here
                    break;
                case ']':
                    // [CODE OMITTED]
                    // The only return statement in this method
                    // is in this case
                    break;
                default:
                    // [CODE OMITTED]
                    break;
            }
            node = range(bits);
    
            // [CODE OMITTED]
            ch = peek();
        }
    }
    
    private CharProperty clazz(boolean consume) {
        CharProperty prev = null;
        CharProperty node = null;
        BitClass bits = new BitClass();
        boolean include = true;
        boolean firstInClass = true;
        int ch = next();
        for (;;) {
            switch (ch) {
                case '^':
                    // Negates if first char in a class, otherwise literal
                    if (firstInClass) {
                        if (temp[cursor-1] != '[')
                            break;
                        ch = next();
                        include = !include;
                        continue;
                    } else {
                        // ^ not first in class, treat as literal
                        break;
                    }
                case '[':
                    firstInClass = false;
                    node = clazz(true);
                    if (prev == null)
                        prev = node;
                    else
                        prev = union(prev, node);
                    ch = peek();
                    continue;
                case '&':
                    // [CODE OMITTED]
                    // There are interesting things (bugs) here,
                    // but it is not relevant to the discussion.
                    continue;
                case 0:
                    firstInClass = false;
                    if (cursor >= patternLength)
                        throw error("Unclosed character class");
                    break;
                case ']':
                    firstInClass = false;
    
                    if (prev != null) {
                        if (consume)
                            next();
    
                        return prev;
                    }
                    break;
                default:
                    firstInClass = false;
                    break;
            }
            node = range(bits);
    
            if (include) {
                if (prev == null) {
                    prev = node;
                } else {
                    if (prev != node)
                        prev = union(prev, node);
                }
            } else {
                if (prev == null) {
                    prev = node.complement();
                } else {
                    if (prev != node)
                        prev = setDifference(prev, node);
                }
            }
            ch = peek();
        }
    }