.net 破译正则表达式
请有人帮我理解这个正则表达式,它用来匹配HTML中.net 破译正则表达式,.net,regex,.net,Regex,请有人帮我理解这个正则表达式,它用来匹配HTML中img标记的src属性 src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+)) src= this is easy (?:(['""])(?<src>(?:(?!\1).)*) ?: is unknown (['""]) matches either single or double
img
标记的src
属性
src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))
src= this is easy
(?:(['""])(?<src>(?:(?!\1).)*) ?: is unknown (['""]) matches either single or double quotes, followed by a named group "src" that matches unknown strings
\1 unknown
| "or"
(?<src>[^\s>]+)) named group "src" matches one or more of line start or whitespace
src=(?:(['”)(?:(!\1)。*)\1 |(?[^\s>]+)
src=这很容易
(?:(['”“])(?(?:(!\1)。)*)?:未知(['”“])匹配单引号或双引号,后跟匹配未知字符串的命名组“src”
\1未知
|“或”
(?[^\s>]+)命名的组“src”匹配一个或多个行起始或空格
简而言之,?:
是什么意思
因此,(?:…)
是普通括号的非捕获版本。匹配括号内的任何正则表达式,但在执行匹配或稍后在模式中引用后,无法检索组匹配的子字符串
谢谢@embratch
\1是什么意思
最后,感叹号在这里有什么特别的意义吗?(否定?这可能有助于您理解正则表达式
(?:(['""])((?:(?!\1).)*)\1|([^\s>]+))
1>它首先捕获组1中的任何一个
['''']
,即(['''])
2> 然后将0与组1中未捕获的字符进行匹配,即(?:(?!\1)。*
3> 它执行步骤2,直到与组1中捕获的匹配,即\1
上述3个步骤类似于(['''')[^\1]*\1
或
1> 它匹配src=
之后的所有非空格、>字符,即[^\s>]+
注意 我会使用
src=(['''').*?\1
*
是贪婪的,它尽可能地匹配
*?
是惰性的,它尽可能少地匹配
例如,考虑这个字符串<代码> Hello HyWord < /Calp>
对于正则表达式^h.*l
输出将是hello hi worl
对于regex
^h.*l
输出将是hel
我使用RegexBuddy获得此输出:
Match the characters “src=” literally «src=»
Match the regular expression below «(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))»
Match either the regular expression below (attempting the next alternative only if this one fails) «(['""])(?<src>(?:(?!\1).)*)\1»
Match the regular expression below and capture its match into backreference number 1 «(['""])»
Match a single character present in the list “'"” «['""]»
Match the regular expression below and capture its match into backreference with name “src” «(?<src>(?:(?!\1).)*)»
Match the regular expression below «(?:(?!\1).)*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\1)»
Match the same text as most recently matched by capturing group number 1 «\1»
Match any single character that is not a line break character «.»
Match the same text as most recently matched by capturing group number 1 «\1»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «(?<src>[^\s>]+)»
Match the regular expression below and capture its match into backreference with name “src” «(?<src>[^\s>]+)»
Match a single character NOT present in the list below «[^\s>]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A whitespace character (spaces, tabs, line breaks, etc.) «\s»
The character “>” «>»
匹配字符“src=”字面上的«src=»
匹配下面的正则表达式«(?:(['))(?:(!\1)。*)\1 |(?[^\s>]+)»
匹配下面的正则表达式(仅当此正则表达式失败时才尝试下一个替换)«(['“”])(?(?:(?!\1)。*)\1»
匹配下面的正则表达式,并将其匹配捕获到反向引用编号1«(['“”)]»中
匹配列表“«[””“]”中的单个字符
匹配下面的正则表达式,并将其匹配捕获到名为“src”«(?:(?!\1)。*)»的反向引用中
匹配下面的正则表达式«(?:(!\1)。*»
在零次和无限次之间,尽可能多次,根据需要回馈(贪婪)«*»
断言不可能从该位置开始匹配下面的正则表达式(负前瞻)«(?!\1)»
通过捕获组号1«\1»匹配与最近匹配的相同文本
匹配不是换行符的任何单个字符«。»
通过捕获组号1«\1»匹配与最近匹配的相同文本
或匹配下面第2个正则表达式(如果该组不匹配,则整个组将失败)«(?[^\s>]+)»
匹配下面的正则表达式,并将其匹配捕获到名为“src”«([^\s>]+)»的反向引用中
匹配以下列表中不存在的单个字符«[^\s>]+»
在一次和无限次之间,尽可能多次,根据需要回馈(贪婪)«+»
空白字符(空格、制表符、换行符等)«\s»
字符“>”«>»
<>这个正则表达式对于你所描述的非常糟糕。<代码> SRC=“< /Cord>”是一个有效的输入。 < P>例如,考虑<代码> SRC=“img.jpg”< /c>作为解析的文本< /p> 在正则表达式中,
\1
表示第一个捕获组。在这种特殊情况下,第一个捕获组是(['“”])
。在我们的示例中,(?:([”)(?:(!\1)。*)
节是一个非捕获组,它与“img.jpg
匹配。特别是,([”)
匹配任何引号字符。然后,(?!\1)
是第一个组中匹配的引号字符的负前瞻,因此(?:(?!\1)。)
匹配任何不是第一组匹配的引号字符的字符,(?(?:(?!\1)。*)*
在命名捕获组中匹配结束引号字符之前的字符序列。然后以下\1
匹配结束引号字符。src=#匹配文字“src=”
src= # matches literal "src="
(?: # the ?: suppresses capturing. generally a good practice if capturing
# is not explicitly necessary
(['"]) # matches either ' or ", and captures what was matched in group 1
# (because this is the first set of parentheses where capturing is not
# suppressed)
(?<src> # start another (named) capturing group with the name "src"
(?: # start non-capturing group
(?!\1)
# a negative lookahead, if its contents match, the lookahead causes the
# pattern to fail
# the \1 is a backreference and matches what was matched in capturing
# group no. 1
.)* # match any character, end of non-capturing group, repeat
# summary of this non-capturing group: for each character, check that
# it is not the kind of quote we matched at the start. if it's not,
# then consume it. repeat as long as possible.
) # end of capturing group "src"
\1 # again a backreference to what was matched inside capturing group 1
# i.e. match the same kind of quote that started the attribute value
| # or
(?<src> # again a capturing group with the name "src"
[^\s>]+
# match as many non-space, non-> character as possible (at least one)
) # end of capturing group. this case treats unquoted attribute values.
) # end of non-capturing group (which was used to group the alternation)
(?:#the?:禁止捕获。通常,如果捕获
#没有明确的必要
(['“])#匹配“或”,并捕获组1中匹配的内容
#(因为这是第一组不需要捕获的括号
#抑制)
(?#启动另一个名为“src”的(命名)捕获组
(?:#启动非捕获组
(?!\1)
#负前瞻,如果其内容匹配,则前瞻会导致
#失败模式
#\1是一个反向引用,与捕获中匹配的内容相匹配
#第一组
)*#匹配任何字符,结束非捕获组,重复
#此非捕获组的摘要:对于每个角色,请检查
#这不是我们一开始匹配的报价如果不是,
#然后吃掉它。尽可能长时间地重复。
)#捕获组“src”结束
\1#再次返回到捕获组1内匹配的内容
#即,匹配开始属性值的相同类型的引号
|#或
(?#还是一个名为“src”的捕获组
[^\s>]+
#匹配尽可能多的非空格、非->字符(至少一个)
)#队长结束