Javascript replace（）正则表达式太贪婪_Javascript_Regex_Non Greedy

Javascript replace（）正则表达式太贪婪

javascript regex

Javascript replace（）正则表达式太贪婪,javascript,regex,non-greedy,Javascript,Regex,Non Greedy,我正在尝试清理HTML输入字段。我想保留一些标记，但不是所有标记，因此在读取元素值时不能只使用.text（）。我在Safari中使用JavaScript正则表达式时遇到了一些问题。下面是代码片段（我从另一个SO线程答案复制了这段正则表达式）：然而，正则表达式一直抓取到第一个匹配的第二个标记，因此我丢失了第一行输出。（实际上，只要锚元素是相邻的，它就会在列表的最下面抓取。）输入是一个长字符串，而不是用CR/LF或任何东西分割成多行我尝试过使用这样的非贪婪标志（请注意第二个问号）： /（.*）/

我正在尝试清理HTML输入字段。我想保留一些标记，但不是所有标记，因此在读取元素值时不能只使用

.text（）

。我在Safari中使用JavaScript正则表达式时遇到了一些问题。下面是代码片段（我从另一个SO线程答案复制了这段正则表达式）：

然而，正则表达式一直抓取到第一个匹配的第二个

标记，因此我丢失了第一行输出。（实际上，只要锚元素是相邻的，它就会在列表的最下面抓取。）输入是一个长字符串，而不是用CR/LF或任何东西分割成多行

我尝试过使用这样的非贪婪标志（请注意第二个问号）：

/（.*）/ig

但这似乎没有改变任何事情（至少在我尝试的几个测试仪/解析器中没有改变，这里有一个：）。也尝试了

/U

标志，但没有帮助（或者这些解析器没有识别它）

有什么建议吗？

使用

href=“[^”]+”

而不是

href=\“（.*？\”

基本上，这将抓取任何字符，直到它遇到下一个

“

虽然实现像标记语法这样的东西可能会更容易，这样您就不必担心去掉错误的标记，只需在显示文本时去掉所有标记并用html标记替换标记即可

例如，打开，这样您就可以通过使用

[链接文本](http://linkurl.com)

替换的正则表达式是

var displayText = "This is just some text [and this is a link](http://example.com) and then more text";
var linkMarkdown = /\[([^\]]+)\]\(([^\)]+)\)/;
displayText.replace(linkMarkdown,'<a href="$2">$1</a>');

var displayText=“这只是一些文本[这是一个链接](http://example.com)然后是更多的文字”；
var linkMarkdown=/\[（[^\]+）\]\（[^\）]+）\/；
displayText.replace（链接标记“”）；

或者使用一个已经制作好的库来进行转换。

模式中有几个错误和可能的改进：

/<
\s*    #  not needed (browsers don't recognize "< a" as an "a" tag)

a      #  if you want to avoid a confusion between an "a" tag and the start
       # of an "abbr" tag, you can add a word boundary or better, a "\s+" since
       # there is at least one white character after.

.      #  The dot match all except newlines, if you have an "a" tag on several
       # lines, your pattern will fail. Since Javascript doesn't have the 
       # "singleline" or "dotall" mode, you must replace it with `[\s\S]` that
       # can match all characters (all that is a space + all that is not a space)

*      #  Quantifiers are greedy by default. ".*" will match all until the end of
       # the line, "[\s\S]*" will match all until the end of the string!
       # This will cause to the regex engine a lot of backtracking until the last
       # "href" will be found (and it is not always the one you want)

href=  # You can add a word boundary before the "h" and put optional spaces around
       # the equal sign to make your pattern more "waterproof": \bhref\s*=\s*

\"     #  Don't need to be escaped, as Markasoftware notices it, an attribute
       # value is not always between double quotes. You can have single quotes or
       # no quotes at all. (1)
(.*?)
\"     # same thing
.*     # same thing: match all until the last >
>(.*?)<\/a>/gi

详情：

\bhref\s*=\s*
(["']?)     # capture group 1: can contain a single, a double quote or nothing 
([^"'\s>]*) # capture group 2: all that is not a quote to stop before the possible
            # closing quote, a space (urls don't have spaces, however javascript
            # code can contain spaces) or a ">" to stop at the first space or
            # before the end of the tag if quotes are not used. 
\1          # backreference to the capture group 1

请注意，如果您使用此子模式，您将添加一个捕获组，

标记之间的内容现在位于捕获组3中。考虑将替换字符串

$2

更改为

$3

总之，您可以这样编写模式：

aString.replace(/<a\s+[\s\S]*?\bhref\s*=\s*(["']?)([^"'\s>]*)\1[^>]*>([\s\S]*?)<\/a>/gi,
               '$3 (Link->$1)');

/<\s*a.*?href=\"(.*?)\".*>(.*?)<\/a>/gi
        ^

aString.replace（/$1）”；

谢谢大家的建议；它对我帮助很大，并且有很多改进的想法

但我想我找到了原始正则表达式失败的具体原因。卡西米尔的回答触及了这一点，但直到我偶然发现了这一点，我才明白

我一直在错误的地方寻找问题，这里：

/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi
                       ^

我确实计划使用这里的其他建议来进一步改进我的陈述

--C

正如您所知，您的正则表达式不足以保护

标记，它们可以在href属性中使用单引号或不使用引号。或者他们可以使用内联onclick或其他事件处理程序。幸运的是，这不是一个公共系统，所以我不担心安全性。这只是试图从粘贴到此字段的文本中删除格式。（这是一个内部评论系统-他们在数据库中输入对父记录的评论。）

href=“[^”]+”

将比

href=“[^”]+”

更优化，因为在匹配失败的情况下，它不会尝试回溯到不可能的匹配（信息）。但是Markasoftware有一点，这个正则表达式可能没有足够的保护。@Robin:Javascript正则表达式既没有所有格量词，也没有原子组。但是，您可以使用以下技巧模拟原子组（这与

（？>a+）a++

）：

“（？=（[^”]+）\1“

因为前瞻是原子的。我将尝试一下建议的更改。我认为这种用法的降价幅度有点大。这是一个跟踪评论的内部系统；我只需要剥离大部分来自可能复制/粘贴信息的人的格式化HTML。我正试图表现得很好，从他们可能复制的链接中提取URL。@帕特里克：这似乎对我不起作用（至少在Refidle的测试仪/解析器中是这样）。下面是我用于正则表达式的字符串（以及您建议的编辑）：

啊，是的，它看起来确实起作用了。谢谢你的小提琴。我在refddle[link]（）中也使用了它。（一定是因为我重新填充时出现了一个小故障，刷新不正确。我在重新加载整个页面后才让它在那里工作。）哇，太棒了！我欣赏它的彻底性。我将浏览这些内容并进行一些更改。
aString.replace(/<a\s+[\s\S]*?\bhref\s*=\s*(["']?)([^"'\s>]*)\1[^>]*>([\s\S]*?)<\/a>/gi,
               '$3 (Link->$1)');

/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi
                       ^

/<\s*a.*?href=\"(.*?)\".*>(.*?)<\/a>/gi
        ^