正则表达式字符类的双重否定中的错误? 更新:在Java11中,下面描述的bug似乎已经修复 (可能它在更早的时候就被修复了,但我不知道具体是哪个版本。关于Java 9中链接的类似问题)。
TL;DR(修复前):正则表达式字符类的双重否定中的错误? 更新:在Java11中,下面描述的bug似乎已经修复 (可能它在更早的时候就被修复了,但我不知道具体是哪个版本。关于Java 9中链接的类似问题)。,java,regex,Java,Regex,TL;DR(修复前): 为什么[^\\D2],[^[^0-9]2],[^2[^0-9]]在Java中得到不同的结果 用于测试的代码。你现在可以跳过它 String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" }; String[] tests = {
为什么
[^\\D2]
,[^[^0-9]2]
,[^2[^0-9]]
在Java中得到不同的结果
用于测试的代码。你现在可以跳过它
String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };
System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
test.matches(regexes[0]), test.matches(regexes[1]),
test.matches(regexes[2]), test.matches(regexes[3]),
test.matches(regexes[4]), test.matches(regexes[5]));
假设我需要正则表达式,它将接受
- 不是数字
- 除了
2
0
、1
、3
、4
、<代码>9。我至少可以用两种方式来写,这两种方式是所有不是以2为单位的数字的总和:
[^0-9]2]
[\\D2]
match , [[^0-9]2] , [\D2]
--------------------------
x , true , true
1 , false , false
2 , true , true
3 , false , false
^ , true , true
[ , true , true
] , true , true
match | [[^0-9]2] , [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]]
------+--------------------+-------------------------------------------
x | true , true | false , false , true , true
1 | false , false | true , true , false , true
2 | true , true | false , false , false , false
3 | false , false | true , true , false , true
^ | true , true | false , false , true , true
[ | true , true | false , false , true , true
] | true , true | false , false , true , true
现在让我们假设我想要反转接受的字符。(因此我想要接受除2以外的所有数字)
我可以创建regex,它显式地包含所有接受的字符,比如
[013-9]
[^…]
来否定前面描述的两个正则表达式,如
[^\\D2]
[^[^0-9]2]
甚至[^2[^0-9]]
match , [[^0-9]2] , [\D2]
--------------------------
x , true , true
1 , false , false
2 , true , true
3 , false , false
^ , true , true
[ , true , true
] , true , true
match | [[^0-9]2] , [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]]
------+--------------------+-------------------------------------------
x | true , true | false , false , true , true
1 | false , false | true , true , false , true
2 | true , true | false , false , false , false
3 | false , false | true , true , false , true
^ | true , true | false , false , true , true
[ | true , true | false , false , true , true
] | true , true | false , false , true , true
所以我的问题是为什么[^0-9]2]
或[^2[^0-9]]
的行为不像[^\D2]
?我能否以某种方式更正这些正则表达式,以便能够在它们内部使用[^0-9]
?根据嵌套类生成两个类的并集,这使得使用该符号无法创建交集:
要创建联合,只需将一个类嵌套在另一个类中,例如[0-4[6-8]]。这个特定的并集创建一个与数字0、1、2、3、4、6、7和8匹配的单个字符类
要创建交叉口,您必须使用&&
:
要创建仅匹配其所有嵌套类的公共字符的单个字符类,请使用&&,如[0-9&&[345]]中所示。此特定交集创建一个仅与两个字符类(3、4和5)共有的数字匹配的单个字符类
你问题的最后一部分对我来说也是个谜。
[^2]
和[^0-9]
的联合实际上应该是[^2]
,因此[^2[^0-9]]
的行为符合预期<代码>[^[^0-9]2]表现得像[^0-9]
确实很奇怪。在Oracle实现的模式类的字符类解析代码中有一些奇怪的巫毒,如果您从Oracle网站下载JRE/JDK或使用OpenJDK,JRE/JDK会附带这些代码。我没有检查其他JVM(特别是)实现如何解析问题中的正则表达式
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0030][U+0031]
01
Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
从这一点来看,对模式
类及其内部工作的任何引用都严格限于Oracle的实现(参考实现)
阅读和理解Pattern
class如何解析问题中所示的嵌套否定将需要一些时间。但是,我编写了一个程序1,从模式对象(with)中提取信息,以查看编译结果。下面的输出来自在Java HotSpot客户端VM 1.7.0_51版上运行我的程序
1:现在节目搞得一团糟。当我完成并重构这篇文章时,我会用一个链接来更新它
这并不奇怪
[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
上面接下来的两个案例被编译成与[^0-9]
相同的程序,这是违反直觉的
[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
由于相同的错误,正则表达式[^2[^0-9]]
编译为与[^2[^0-9]]
相同的程序
有一个未解决的bug似乎具有相同的性质:
解释
初步的
下面是Pattern
类的实现细节,在进一步阅读之前应该了解这些细节:
模式
类将一个字符串
编译成一个节点链,每个节点负责一个小的、定义明确的职责,并将工作委托给链中的下一个节点<代码>节点
类是所有节点的基类
CharProperty
类是所有与节点相关的字符类的基类
BitClass
class是CharProperty
类的一个子类,它使用boolean[]
数组来加速拉丁字符1的匹配(代码点感谢您的回答。这也是我开始时的想法。它看起来像[^2[^0-9]
[^2]
是先创建的,后来正则表达式引擎使用union将其与[^0-9]
组合,所以它不会改变任何东西,因为这两个类的总和是[^2]
([^0-9]
是[^2]
的子集)。我感到困扰的是为什么[^0-9]2
的行为与[^0-9]
相同,而不是[^2]
?@Pshemo:更新了一点答案。我在想,javadoc中的所有示例都将嵌套类作为最后一个元素。如果不遵循该约定,行为会有点未定义吗?这很令人费解。如果[]
是外部否定字符类的第一个元素,而不是使用联合交集。DeMorgan可能是这里的罪魁祸首,但我不确定如何将他与这种情况联系起来。更让我困惑的是,[^[^0-9]2]
,[^[^[^0-9]]2]
和[^[^[^[^[^0-9]]2]
所有这些都会产生相同的结果。我尝试查看代码,但不太容易理解。@Keppil甚至支持多个嵌套级别?只是想让您知道[^[^0-9]]
与不同
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0030][U+0031]
01
Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
private CharProperty clazz(boolean consume) {
// [Declaration and initialization of local variables - OMITTED]
BitClass bits = new BitClass();
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
// [CODE OMITTED]
ch = next();
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
// [CODE OMITTED]
ch = peek();
continue;
case '&':
// [CODE OMITTED]
continue;
case 0:
// [CODE OMITTED]
// Unclosed character class is checked here
break;
case ']':
// [CODE OMITTED]
// The only return statement in this method
// is in this case
break;
default:
// [CODE OMITTED]
break;
}
node = range(bits);
// [CODE OMITTED]
ch = peek();
}
}
private CharProperty clazz(boolean consume) {
CharProperty prev = null;
CharProperty node = null;
BitClass bits = new BitClass();
boolean include = true;
boolean firstInClass = true;
int ch = next();
for (;;) {
switch (ch) {
case '^':
// Negates if first char in a class, otherwise literal
if (firstInClass) {
if (temp[cursor-1] != '[')
break;
ch = next();
include = !include;
continue;
} else {
// ^ not first in class, treat as literal
break;
}
case '[':
firstInClass = false;
node = clazz(true);
if (prev == null)
prev = node;
else
prev = union(prev, node);
ch = peek();
continue;
case '&':
// [CODE OMITTED]
// There are interesting things (bugs) here,
// but it is not relevant to the discussion.
continue;
case 0:
firstInClass = false;
if (cursor >= patternLength)
throw error("Unclosed character class");
break;
case ']':
firstInClass = false;
if (prev != null) {
if (consume)
next();
return prev;
}
break;
default:
firstInClass = false;
break;
}
node = range(bits);
if (include) {
if (prev == null) {
prev = node;
} else {
if (prev != node)
prev = union(prev, node);
}
} else {
if (prev == null) {
prev = node.complement();
} else {
if (prev != node)
prev = setDifference(prev, node);
}
}
ch = peek();
}
}