.net 破译正则表达式_.net_Regex

.net 破译正则表达式

.net regex

.net 破译正则表达式,.net,regex,.net,Regex,请有人帮我理解这个正则表达式，它用来匹配HTML中img标记的src属性 src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+)) src= this is easy (?:(['""])(?<src>(?:(?!\1).)*) ?: is unknown (['""]) matches either single or double

请有人帮我理解这个正则表达式，它用来匹配HTML中

img

标记的

src

属性

src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))


src=                               this is easy
(?:(['""])(?<src>(?:(?!\1).)*)     ?: is unknown (['""]) matches either single or double quotes, followed by a named group "src" that matches unknown strings
\1                                 unknown
|                                  "or"
(?<src>[^\s>]+))                   named group "src" matches one or more of line start or whitespace

src=（？：（['”）（？：（！\1）。*）\1 |（？[^\s>]+）
src=这很容易
（？：（['”“]）（？（？：（！\1）。）*）？：未知（['”“]）匹配单引号或双引号，后跟匹配未知字符串的命名组“src”
\1未知
|“或”
（？[^\s>]+）命名的组“src”匹配一个或多个行起始或空格

简而言之，

？：

是什么意思

因此，

（？：…）

是普通括号的非捕获版本。匹配括号内的任何正则表达式，但在执行匹配或稍后在模式中引用后，无法检索组匹配的子字符串

谢谢@embratch

\1是什么意思

最后，感叹号在这里有什么特别的意义吗？（否定？

这可能有助于您理解正则表达式

(?:(['""])((?:(?!\1).)*)\1|([^\s>]+))

1>它首先捕获组1中的任何一个

['''']

，即

（[''']）

2> 然后将0与组1中未捕获的字符进行匹配，即

（？：（？！\1）。*

3> 它执行步骤2，直到与组1中捕获的匹配，即

\1

上述3个步骤类似于

（[''''）[^\1]*\1

或

1> 它匹配

src=

之后的所有非空格、>字符，即

[^\s>]+

注意我会使用

src=（[''''）.*？\1

是贪婪的，它尽可能地匹配

*？

是惰性的，它尽可能少地匹配

例如，考虑这个字符串<代码> Hello HyWord < /Calp>

对于正则表达式

^h.*l

输出将是

hello hi worl

对于regex

^h.*l

输出将是

hel

我使用RegexBuddy获得此输出：

Match the characters “src=” literally «src=»
Match the regular expression below «(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))»
   Match either the regular expression below (attempting the next alternative only if this one fails) «(['""])(?<src>(?:(?!\1).)*)\1»
      Match the regular expression below and capture its match into backreference number 1 «(['""])»
         Match a single character present in the list “'"” «['""]»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>(?:(?!\1).)*)»
         Match the regular expression below «(?:(?!\1).)*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
            Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\1)»
               Match the same text as most recently matched by capturing group number 1 «\1»
            Match any single character that is not a line break character «.»
      Match the same text as most recently matched by capturing group number 1 «\1»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «(?<src>[^\s>]+)»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>[^\s>]+)»
         Match a single character NOT present in the list below «[^\s>]+»
            Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
            A whitespace character (spaces, tabs, line breaks, etc.) «\s»
            The character “>” «>»

匹配字符“src=”字面上的«src=»
匹配下面的正则表达式«（？：（['））（？：（！\1）。*）\1 |（？[^\s>]+）»
匹配下面的正则表达式（仅当此正则表达式失败时才尝试下一个替换）«（['“”]）（？（？：（？！\1）。*）\1»
匹配下面的正则表达式，并将其匹配捕获到反向引用编号1«（['“”）]»中
匹配列表“«[””“]”中的单个字符
匹配下面的正则表达式，并将其匹配捕获到名为“src”«（？：（？！\1）。*）»的反向引用中
匹配下面的正则表达式«（？：（！\1）。*»
在零次和无限次之间，尽可能多次，根据需要回馈（贪婪）«*»
断言不可能从该位置开始匹配下面的正则表达式（负前瞻）«（？！\1）»
通过捕获组号1«\1»匹配与最近匹配的相同文本
匹配不是换行符的任何单个字符«。»
通过捕获组号1«\1»匹配与最近匹配的相同文本
或匹配下面第2个正则表达式（如果该组不匹配，则整个组将失败）«（？[^\s>]+）»
匹配下面的正则表达式，并将其匹配捕获到名为“src”«（[^\s>]+）»的反向引用中
匹配以下列表中不存在的单个字符«[^\s>]+»
在一次和无限次之间，尽可能多次，根据需要回馈（贪婪）«+»
空白字符（空格、制表符、换行符等）«\s»
字符“>”«>»

<>这个正则表达式对于你所描述的非常糟糕。<代码> SRC=“< /Cord>”是一个有效的输入。

< P>例如，考虑<代码> SRC=“img.jpg”< /c>作为解析的文本< /p> 在正则表达式中，

\1

表示第一个捕获组。在这种特殊情况下，第一个捕获组是

（['“”]）

。在我们的示例中，

（？：（[”）（？：（！\1）。*）

节是一个非捕获组，它与

“img.jpg

匹配。特别是，

（[”）

匹配任何引号字符。然后，

（？！\1）

是第一个组中匹配的引号字符的负前瞻，因此

（？：（？！\1）。）

匹配任何不是第一组匹配的引号字符的字符，

（？（？：（？！\1）。*）*

在命名捕获组中匹配结束引号字符之前的字符序列。然后以下

\1

匹配结束引号字符。

src=#匹配文字“src=”
src=      # matches literal "src="
(?:       # the ?: suppresses capturing. generally a good practice if capturing
          # is not explicitly necessary
  (['"])  # matches either ' or ", and captures what was matched in group 1
          # (because this is the first set of parentheses where capturing is not
          # suppressed)
  (?<src> # start another (named) capturing group with the name "src"
    (?:   # start non-capturing group
      (?!\1)
          # a negative lookahead, if its contents match, the lookahead causes the
          # pattern to fail
          # the \1 is a backreference and matches what was matched in capturing
          # group no. 1
    .)*   # match any character, end of non-capturing group, repeat
          # summary of this non-capturing group: for each character, check that
          # it is not the kind of quote we matched at the start. if it's not,
          # then consume it. repeat as long as possible.

  )       # end of capturing group "src"
  \1      # again a backreference to what was matched inside capturing group 1
          # i.e. match the same kind of quote that started the attribute value
|         # or
  (?<src> # again a capturing group with the name "src"
    [^\s>]+
          # match as many non-space, non-> character as possible (at least one)
  )       # end of capturing group. this case treats unquoted attribute values.
)         # end of non-capturing group (which was used to group the alternation)

（？：#the？：禁止捕获。通常，如果捕获
#没有明确的必要
（['“]）#匹配“或”，并捕获组1中匹配的内容
#（因为这是第一组不需要捕获的括号
#抑制）
（？#启动另一个名为“src”的（命名）捕获组
（？：#启动非捕获组
(?!\1)
#负前瞻，如果其内容匹配，则前瞻会导致
#失败模式
#\1是一个反向引用，与捕获中匹配的内容相匹配
#第一组
）*#匹配任何字符，结束非捕获组，重复
#此非捕获组的摘要：对于每个角色，请检查
#这不是我们一开始匹配的报价如果不是，
#然后吃掉它。尽可能长时间地重复。
)#捕获组“src”结束
\1#再次返回到捕获组1内匹配的内容
#即，匹配开始属性值的相同类型的引号
|#或
（？#还是一个名为“src”的捕获组
[^\s>]+
#匹配尽可能多的非空格、非->字符（至少一个）
)#队长结束