Regex 用于将转义HTML标记与其他正则表达式一起包含的正则表达式_Regex

Regex 用于将转义HTML标记与其他正则表达式一起包含的正则表达式

regex

Regex 用于将转义HTML标记与其他正则表达式一起包含的正则表达式,regex,Regex,我有一个子串： <a href="http://www.somesite.com/" target="_blank"> 并且在互联网上找到了这个正则表达式来识别这个字符串的URL部分 \b（https？| ftp |文件）：/[-A-Z0-9+&@#/%？=~ |！：，.；]*[-A-Z0-9+&@#/%=~ |] 但是，此正则表达式不包括包含转义的HTML文本a href=“和“target=“\u blank” 我需要能够识别大型文档中的完整字符串，因此这包括为

我有一个子串：

&lt;a href="http://www.somesite.com/" target="_blank"&gt;

并且在互联网上找到了这个正则表达式来识别这个字符串的URL部分

\b（https？| ftp |文件）：/[-A-Z0-9+&@#/%？=~ |！：，.；]*[-A-Z0-9+&@#/%=~ |]

但是，此正则表达式不包括包含转义的HTML文本a href=“
和“target=“\u blank”

我需要能够识别大型文档中的完整字符串，因此这包括为上述字符串的未缩放HTML部分编写额外的正则表达式。为了找到上面的字符串，正则表达式是什么样子的

谢谢

Regex对html可能不是个好主意。但是，由于使用字符引用作为标记的情况很奇怪，它可能不是真正的html

这个Perl示例可能有效，但我不确定：

use strict;
use warnings;

my $samp = '
 &lt;a href="http://www.somesite.com/" target="_blank"&gt;
 <a target="_blank" href="http://www.someothersite.com/" &gt;
';

my $regex = qr{
(
 (?:<|&lt;)a
    (?=\s) (?:(?!&gt;|>)[\S\s])*
    (?<=\s) href \s* = \s* 
        " \s* ((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]) \s* "
    (?:(?!&gt;|>)[\S\s])* (?<!/)
 (?:>|&gt;)
)
}x;


while ($samp =~ /$regex/g) {
    print "In: '$1'\nfound: '$2'\n--------\n";
}

使用严格；
使用警告；
我的$samp
a href=”http://www.somesite.com/“target=“\u blank”
In: '&lt;a href="http://www.somesite.com/" target="_blank"&gt;'
found: 'http://www.somesite.com/'
--------
In: '<a target="_blank" href="http://www.someothersite.com/" &gt;'
found: 'http://www.someothersite.com/'
--------