Java Node.JS正则表达式引擎在大输入时失败_Java_Python_Regex_Node.js_V8

Java Node.JS正则表达式引擎在大输入时失败

java python regex node.js

Java Node.JS正则表达式引擎在大输入时失败,java,python,regex,node.js,v8,Java,Python,Regex,Node.js,V8,这个问题有点复杂，谷歌搜索并没有真正起到帮助作用。我将试着只介绍它的相关方面我有一个大致如下格式的大型文档：样本输入： ABC is a word from one line of this document. It is followed by some random line PQR which happens to be another word. This is just another line I have to fix my regular expression. Here G

这个问题有点复杂，谷歌搜索并没有真正起到帮助作用。我将试着只介绍它的相关方面

我有一个大致如下格式的大型文档：

样本输入：

ABC is a word from one line of this document. It is followed by
some random line
PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
Here GHI appears in the middle.
This may be yet another line.
VWX is a line
this is the last line

PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
VWX is a line
this is the last line

我正试图根据以下内容删除文本部分：

来自以下任一方：
- ABC
- DEF
- GHI
在保留该词的同时：
- PQR
- 斯图
- VWX

构成“From”的单词可以出现在一行中的任何地方（看看GHI）。但要拆除，需要拆除整个管线。（需要删除包含GHI的整行，如下面的示例输出所示）

样本输出：

ABC is a word from one line of this document. It is followed by
some random line
PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
Here GHI appears in the middle.
This may be yet another line.
VWX is a line
this is the last line

PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
VWX is a line
this is the last line

上面的例子对我来说似乎很简单，直到我在非常大的输入文件（49KB）上运行它

我所尝试的：

ABC is a word from one line of this document. It is followed by
some random line
PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
Here GHI appears in the middle.
This may be yet another line.
VWX is a line
this is the last line

PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
VWX is a line
this is the last line

我当前使用的正则表达式是（带有不区分大小写和多行修饰符）：

问题

上面的regexp在小文本文件上工作得非常好。但引擎在大文件上出现故障/崩溃。我已针对以下方面进行了尝试：

V8（Node.js）：挂起
犀牛：悬挂
Python:挂起
Java:
```
StackOverflowerError
```
（堆栈跟踪张贴在问题末尾）
IonMonkey（Firefox）：有效

实际输入：

我的原始输入：

我的正则表达式（为清晰起见，拆分为多行）：

问题：

我的正则表达式正确吗？是否可以进一步优化以避免此问题

如果它是正确的，为什么其他引擎无限悬挂？堆栈跟踪的一部分如下所示：

堆栈跟踪：

Exception in thread "main" java.lang.StackOverflowError at java.util.regex.Pattern$GroupTail.match(Pattern.java:4218) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)

PS：我在这个问题上添加了几个标签，因为我在那些环境中尝试过，但实验失败了。
我很想尝试简化re。老实说，目前情况并不复杂，但是：

\b(abc|def|ghi)\b.*\b(pqr|stu|vwx)\b

这不是你现在所做的，但是在开始的时候，线锚和中间不必要的可选元素吗？可能没有什么区别，但可能值得一试。
我认为您的问题可能在于，随着文件越来越长，您可以通过大约nx m/2的距离来匹配from和to块对。这意味着您将获得指数级的更多结果，这些结果占用越来越多的源文件。如果文件以ABC开头，以VWX结尾，那么其中一个匹配项就是整个文件
为了减少正则表达式引擎需要处理的匹配，我的第一种方法是分别在
（abc | def | ghi）
和
（pqr | stu | vwx）
上使用正则表达式。在返回结果后，您可以遍历每个from匹配，并尝试找到要阻止的第一个匹配。一些psuedo代码可以实现这一点

from = regex.match(file, '(abc|def|ghi)') to = regex.match(file, '(pqr|stu|vwx)') for each match in from: for index in to: if index > match: add index, match to results break for each result: parse backwards to the beginning of the line edit the file to remove the matching text
尽管这为您自己创建了更多的工作，但这意味着正则表达式解析器不必一次将整个n kB文件保存在内存中，并且可以更有效地通过小块进行解析。
问题在于（.|\s）*，因为任何空格字符都将同时匹配这两个选项，并允许它同时使用这两个选项。这使得它变得指数级的大
您可以在ruby中看到这个正则表达式的问题

str = "b" + "a" * 200 + "cbab" /b(a|a)*b/.match str
这需要很长时间，而基本上是一样的

/ba*b/.match str
比赛很快

您可以通过仅使用
*
或如果
与换行符
（.|\n）*
不匹配来解决此问题。问题可能是regexp引擎之间的实现不同。主要有两种类型的重新引擎：
基于回溯搜索的
和
基于NFA的
<代码>基于NFA的引擎需要更多内存来预处理regexp（以构建NFA），而回溯引擎则不需要。但是，在进行比赛时，情况会发生变化。这里有一些非常有用的参考资料：谢谢你的回答。我有
^.*
，因为我需要删除整个“From”行。中间没有可选元素<代码>*？用于非贪婪匹配。对。我懂了。我所指的“可选”是介于两者之间的“或”。和\s。我错过了非贪婪/懒惰的限定词。哦，好的。这是因为中间元素的匹配可能跨越多行（如示例输入中）；这应该是不贪婪的。这就是为什么我有
（.|\s）*？
。regexp中的
通常与换行符不匹配。请更正分析。如果可能，请始终优先选择类或条件：如果您知道文本，请尝试
[\w\d.\s\n]*
而不是
（.\124;\ n）*
分支越少越好。