Regex java 8中的Pattern.matcher与节号§不匹配;

Regex java 8中的Pattern.matcher与节号§不匹配;,regex,java-8,Regex,Java 8,我有以下代码段: private static final Pattern ESCAPER_PATTERN = Pattern.compile("[^a-zA-Z0-9\\p{P}\\s]*"); /** * @param args */ public static void main(String[] args) { String unaccentedText = "Aa123 \\/*-+.=+:/;.,?u%µ£$*^¨-)ac!e§('\"e&€#²³~´][{^"

我有以下代码段:

private static final Pattern ESCAPER_PATTERN = Pattern.compile("[^a-zA-Z0-9\\p{P}\\s]*");

/**
 * @param args
 */
public static void main(String[] args)
{
    String unaccentedText = "Aa123 \\/*-+.=+:/;.,?u%µ£$*^¨-)ac!e§('\"e&€#²³~´][{^";
    System.out.println(ESCAPER_PATTERN.matcher(unaccentedText).replaceAll(""));         
}
当我使用JDK 7执行此操作时,得到的输出是:

Aa123 \/*-.:/;.,?u%*-)ac!e('"e&#][{ Aa123 \/*-.:/;.,?u%*-)ac!e§('"e&#][{ Aa123\/*-.:/;。,?u%*-)ac!e(“'e&#][{ 当我使用JDK 8执行相同的操作时,我得到的输出是:

Aa123 \/*-.:/;.,?u%*-)ac!e('"e&#][{ Aa123 \/*-.:/;.,?u%*-)ac!e§('"e&#][{ Aa123\/*-.:/,?u%*-)ac!e§(''e&#][{ 请注意,JDK 8没有删除章节标志§

请让我知道JDK 8中要使用的正则表达式,以匹配剖面符号,以及JDK之间行为差异的原因。

Unicode移动了您的奶酪 字符
U+00A7剖面符号
已从类别So(符号,其他)更改为类别Po(标点符号,其他):

  • UnicodeData.txt

    • U+00A7、U+00B6、U+0F14、U+1360和U+10102从gc=So更改为gc=Po
由于Java在版本7中使用,并在版本8中更新为,因此它解释了结果的差异。由于
§
现在属于标点符号类别,因此在Java 8中它与
\p{p}
匹配

错误的解决方案 由于常规标点符号如
#
,…也属于采购订单类别,因此我们无法真正删除此子类别

下一个显而易见的解决方案是使用字符集交集删除不需要的字符:

"[^a-zA-Z0-9\\p{P}\\s&&[^\u00a7]]"
…但是,等一下,上面的正则表达式编译为:

[^a-zA-Z0-9\p{P}\s&&[^§]]
Start. Start unanchored match (minLength=1)
Pattern.intersection. S ∩ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      §
java.util.regex.Pattern$LastNode
Node. Accept match
[[^a-zA-Z0-9\p{P}\s]§]
Start. Start unanchored match (minLength=1)
Pattern.union. S ∪ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  BitClass. Match any of these 1 character(s):
    §
java.util.regex.Pattern$LastNode
Node. Accept match
其汇编目的是:

[^a-zA-Z0-9\p{P}\s&&[^§]]
Start. Start unanchored match (minLength=1)
Pattern.intersection. S ∩ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      §
java.util.regex.Pattern$LastNode
Node. Accept match
[[^a-zA-Z0-9\p{P}\s]§]
Start. Start unanchored match (minLength=1)
Pattern.union. S ∪ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  BitClass. Match any of these 1 character(s):
    §
java.util.regex.Pattern$LastNode
Node. Accept match
[^a-zA-Z0-9\p{p}\s]§]
开始。开始未编排的匹配(minLength=1)
模式联合∪ T:
Pattern.setDifference.S∖ T:
Pattern.setDifference.S∖ T:
Pattern.setDifference.S∖ T:
Pattern.setDifference.S∖ T:
CharProperty.complement.S̄:

Pattern.rangeFor.U+0061我在中没有看到
\p{p}
,应该匹配什么?@SeanBright:查看并查找@nhahtdh中的“一般类别”常量-我有点厚,我在任何地方都看不到“p”。我相信你的话,如果我仔细阅读,我会理解:-)@肖恩布赖特:啊,这似乎没有被清楚地提到。
P
包括所有以
P
(标点符号)开头的速记类别。@nhahtdh-明白了。谢谢你的教育。