除非匹配复杂的正则表达式，否则Java可以有效地替换_Java_Regex_Performance_Replace

除非匹配复杂的正则表达式，否则Java可以有效地替换

java regex performance replace

除非匹配复杂的正则表达式，否则Java可以有效地替换,java,regex,performance,replace,Java,Regex,Performance,Replace,我有超过十亿字节的文本，我需要遍历这些文本，并在标点符号周围加上空格（标记化）。我有一个很长的正则表达式（1818个字符，虽然大部分是列表），它定义了什么时候标点符号不应该分开。由于长且复杂，很难将组与它一起使用，尽管我不会将其作为一个选项，因为我可以使大多数组不捕获（？：）问题：如何有效地替换与特定正则表达式不匹配的某些字符？我已经研究过使用lookaheads或类似的工具，但我还没有完全弄明白，但它似乎效率非常低。不过，这可能比使用占位符要好。我似乎找不到一个好的“用一堆不同的正则表达

我有超过十亿字节的文本，我需要遍历这些文本，并在标点符号周围加上空格（标记化）。我有一个很长的正则表达式（1818个字符，虽然大部分是列表），它定义了什么时候标点符号不应该分开。由于长且复杂，很难将组与它一起使用，尽管我不会将其作为一个选项，因为我可以使大多数组不捕获（？：）

问题：如何有效地替换与特定正则表达式不匹配的某些字符？

我已经研究过使用lookaheads或类似的工具，但我还没有完全弄明白，但它似乎效率非常低。不过，这可能比使用占位符要好。我似乎找不到一个好的“用一堆不同的正则表达式替换一次查找和替换”函数

我应该一行一行地做这件事，而不是对全文进行操作吗

String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
    protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);

//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");

// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");

String completeegex=“[^\\w]（“+protectedPrefixes+”）|（（“+protectedNumericOnly+”）\\s*\\p{N}））|“+protectedRegex；
Matcher protectedM=Pattern.compile（completeRegex.Matcher）；
ArrayList protectedStrs=新的ArrayList（）；
//注意受保护的匹配项。
while（protectedM.find（））{
protectedStrs.add（protectedM.group（））；
}
//替换受保护的匹配项。
字符串replaceStr=“”；
s=protectedM.replaceAll（replaceStr）；
//现在，它是安全的，单独的标点符号。
s=s.replaceAll（“（[^\\p{L}\\p{N}\\p{Mn}\\\-']），“$1”）；
//这些是撇号。这些可以与保护正则表达式或上面的正则表达式结合使用吗？
s=s.replaceAll（“（[\\p{N}\\p{L}]）”（\\p{L}）”，“$1'$2”）；
s=s.replaceAll（“（[^\\p{L}]）”（[^\\p{L}]）”，“$1'$2”）；

注意撇号的另外两个替换项。使用占位符也可以防止这些替换，但无论如何，我并不真正关心保护正则表达式中的撇号或单引号，所以这不是真正的问题

我正在用自己的Java语言重写我认为效率非常低的Perl代码，跟踪速度，事情进展顺利，直到我开始用原始字符串替换占位符。加上这一点，它的速度太慢了，不合理（我从未见过它甚至接近完成）

//用原始文本替换占位符。
字符串resultStr=“”；
字符串currentStr=“”；
int currentPos=0；
int[]protectedArray=replaceStr.codePoints（）.toArray（）；
int protectedLen=protectedArray.length；
int[]strArray=s.codePoints（）.toArray（）；
int protectedCount=0；
对于（int i=0；i 0）{
resultStr+=replaceStr.子字符串（0，currentPos）；
currentPos=0；
currentStr=“”；
}
resultStr+=ParseUtils.getSymbol（pt）；
}
}
s=结果TR；

此代码可能不是返回受保护匹配项的最有效方法。什么是更好的方法？或者更好的是，我如何在不使用占位符的情况下替换标点符号？

起初，我认为替换标点符号不是我想要的，但事实上确实如此。因为它会在最后替换占位符，这会减慢速度，所以我真正需要的是一种动态替换匹配项的方法：

StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
    replaceM.appendReplacement(replacedBuff, "");
    replacedBuff.append(protectedStrs.get(index));
    index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();

参考文献：

考虑的另一个选择：在第一次遍历字符串期间，要查找受保护的字符串，请获取每个匹配的开始和结束索引，替换匹配之外的所有内容的标点符号，添加匹配的字符串，然后继续。这样就不需要编写带有占位符的字符串，只需要对整个字符串进行一次遍历。然而，它确实需要许多单独的小型更换操作。（顺便说一句，确保在循环之前编译模式，而不是使用String.replaceAll（）。类似的替代方法是将未受保护的子字符串添加到一起，然后同时替换它们。但是，受保护的字符串必须在最后添加到替换的字符串中，因此我怀疑这是否会节省时间

int currIndex = 0;
while (protectedM.find()) {
    protectedStrs.add(protectedM.group());
    String substr = s.substring(currIndex,protectedM.start());
    substr = p1.matcher(substr).replaceAll(" $1 ");
    substr = p2.matcher(substr).replaceAll("$1 '$2");
    substr = p3.matcher(substr).replaceAll("$1 ' $2");
    resultStr += substr+protectedM.group();
    currIndex = protectedM.end();
}

100000行文本的速度比较：

原始Perl脚本：272.960579875秒
我的第一次尝试：太长而无法完成
带附件更换（）：14.245160866秒
发现受保护时更换：68.691842962秒

谢谢你，Java，你没有让我失望。

一开始我以为我不是在寻找替代品，但事实上确实如此。因为它会在最后替换占位符，这会减慢速度，所以我真正需要的是一种动态替换匹配项的方法：

StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
    replaceM.appendReplacement(replacedBuff, "");
    replacedBuff.append(protectedStrs.get(index));
    index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();

参考文献：

int currIndex = 0;
while (protectedM.find()) {
    protectedStrs.add(protectedM.group());
    String substr = s.substring(currIndex,protectedM.start());
    substr = p1.matcher(substr).replaceAll(" $1 ");
    substr = p2.matcher(substr).replaceAll("$1 '$2");
    substr = p3.matcher(substr).replaceAll("$1 ' $2");
    resultStr += substr+protectedM.group();
    currIndex = protectedM.end();
}

100000行文本的速度比较：

原始Perl脚本：272.960579875秒
我的第一次尝试：太长而无法完成
带附件更换（）：14.245160866秒
发现受保护时更换：68.691842962秒

谢谢你，Java，你没有让我失望。

我不知道你的中间字符串有多大，但我想你可以比使用

匹配器做得更好。replaceAll

，速度方面

你在穿越stri的路上做了3次传球

void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
  // Pass the whole input string and the bounds, rather than taking a substring.

  // Allocate roughly enough space up-front.
  resultStr.ensureCapacity(resultStr.length() + end - start);

  for (int i = start; i < end; ++i) {
    char c = s.charAt(i);

    // Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
    if (!(Character.isLetter(c)
          || Character.isDigit(c)
          || Character.getType(c) == Character.NON_SPACING_MARK
          || "_\\-<>'".indexOf(c) != -1)) {
      resultStr.append(' ');
      resultStr.append(c);
      resultStr.append(' ');
    } else if (c == '\'' && i > 0 && i + 1 < s.length()) {
      // We have a quote that's not at the beginning or end.
      // Call these 3 characters bcd, where c is the quote.

      char b = s.charAt(i - 1);
      char d = s.charAt(i + 1);

      if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
        // If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
        resultStr.append(' ');
        resultStr.append(c);
      } else if (!Character.isLetter(b) && !Character.isLetter(d)) {
        // If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
        resultStr.append(' ');
        resultStr.append(c);
        resultStr.append(' ');
      } else {
        resultStr.append(c);
      }
    } else {
      // Everything else, just append.
      resultStr.append(c);
    }
  }
}