Javascript 避免HTML标记中的灾难性回溯_Javascript_Regex_Parsing_Html Parsing

Javascript 避免HTML标记中的灾难性回溯

javascript regex parsing

Javascript 避免HTML标记中的灾难性回溯,javascript,regex,parsing,html-parsing,Javascript,Regex,Parsing,Html Parsing,正如我在标题中所说，我的数据集是标记，它看起来有点像这样 <!DOCTYPE html> <html> <head> <title>page</title> </head> <body> <main> <div class="menu"> <img src=mmayboy.jpg> <p> stackoverflow is good <

正如我在标题中所说，我的数据集是标记，它看起来有点像这样

<!DOCTYPE html>
<html>
<head>
    <title>page</title>
</head>
<body>
<main>

<div class="menu">
    <img src=mmayboy.jpg>
    <p> stackoverflow is good </p>
</div>

<div class="combine">
    <p> i have suffered <span>7</span></p>
</div>
</main>
</body>
</html>

它尝试深入该标记并获取所需的节点块。仅此而已。至于内部，我们开始吧

/
(
 <div class="menu"> // match text that begins with these literals
  (
   \s+.*
  )+ /* match any white space or character after previous. But the problem is that this matches up till the closing tag of other DIVs i.e greedy. */
  <\/div> // stop at the next closing DIV (this catches the last DIV)
  (?: // begin non-capturing group 
   (?=
    (
     \s+<div
     ) /* I'm using the positive lookahead to make sure previous match is not followed by a space and a new DIV tag. This is where the catastrophic backtracking is raised. */
   )
  )
 )
/

/
(
//匹配以这些文字开头的文本
(
\s+*
)+/*匹配上一个之后的任何空格或字符。但问题是，这匹配到其他div的结束标记，即贪婪*/
//在下一个收尾DIV处停止（这捕捉到最后一个DIV）
（？：//开始非捕获组
(?=
(
\+
可以简化为
[^]*?

这将防止灾难性的回溯。整体简化：
/<div class="menu">[^]*?<\/div>/

.为了平衡：：…最适合小HTML解析问题，pessimal适合大问题。是的，使用HTML解析器。“简化”正则表达式无法工作，因为在字母数字和符号之前有大量空格字符。其次，DOMParser仍然是实验性的，因此在nodejs上还不可用（这恰好是我需要它的地方）@Mmayboy:啊，对不起，正则表达式已经更新。但是使用，然后。（这不是它在节点上不可用的原因。）您应该再次将它从[^]*？更新到[^]*？：p当这个小小的Dom元素是我想要的时，我还需要一个节点解析器模块吗？更好的是，你介意解释一下它是如何工作的吗？@Mmayboy:Er，哪个部分需要再次更新？是的，如果你想要可靠的HTML解析，你应该使用HTML解析器。如果另一个
在中出现，你会发现我例如，不可能使用正则表达式来提取单个元素。
[^]*?

/<div class="menu">[^]*?<\/div>/

var parser = new DOMParser();
var doc = parser.parseFromString(data, 'text/html');
var menu = doc.getElementsByClassName('menu')[0];

console.log(menu.innerHTML);