Html 用于提取标记属性的正则表达式_Html_Regex

Html 用于提取标记属性的正则表达式

html regex

Html 用于提取标记属性的正则表达式,html,regex,Html,Regex,我正在尝试提取锚标记的属性（）。到目前为止，我有这样一个表达： (?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+ (?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+) （？\b\w+\b）\s*=\s*（？[^

我正在尝试提取锚标记的属性（

）。到目前为止，我有这样一个表达：

(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+

(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)

（？\b\w+\b）\s*=\s*（？[^]*）“|”（？[^']*）”|（？[^']*））（？[^'\s]+）\s*）+

这适用于字符串，例如

<a href="test.html" class="xyz">

和（单引号）

但不适用于不带引号的字符串：

<a href=test.html class=xyz>

我如何修改我的正则表达式，使其与不带引号的属性一起工作？还是有更好的方法

更新：感谢到目前为止所有的好评论和建议。有一件事我没有提到：很遗憾，我不得不修补/修改不是我写的代码。而且并没有时间/金钱从头开始重写这些内容。

我建议您使用将HTML转换为XHTML，然后使用合适的XPath表达式来提取属性。

若您想要更一般，您必须查看标记的精确规范，如。但即使这样，如果你做了完美的正则表达式，如果你有格式错误的html呢

我建议根据您使用的语言，使用一个库来解析html：例如，像python的Beautiful Soup。

令牌咒语响应：您不应该使用正则表达式调整/修改/收获/或以其他方式生成html/xml

有太多像“and”这样的条件必须考虑。你最好使用一个合适的DOM解析器、XML解析器，或者其他几十种经过测试的工具中的一种来完成这项工作，而不是自己发明

我真的不在乎你用哪一个，只要它被认可，经过测试，你用一个

my $foo  = Someclass->parse( $xmlstring ); 
my @links = $foo->getChildrenByTagName("a"); 
my @srcs = map { $_->getAttribute("src") } @links; 
# @srcs now contains an array of src attributes extracted from the page.

如果您在.NET中，我建议您使用HTML敏捷包，即使使用格式错误的HTML，它也非常健壮

然后您可以使用XPath。

更新（2020年），建议（注意

regex101.com

在我最初编写此答案时不存在）

适用于：

原始答复（2008年）：如果你有一个元素，比如

<name attribute=value attribute="value" attribute='value'>

适用于：

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">

<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>

注意：这不适用于数值属性值，例如

不起作用

已编辑：改进了正则表达式，用于获取没有值的属性和内部带有“'”的值

适用于：

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">

<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>

我会重新考虑只使用一个正则表达式的策略。当然，想出一个能完成所有功能的正则表达式是一个不错的游戏。但在可维护性方面，你将自食其果。

只是为了与其他人达成一致：不要使用regexp解析HTML

即使是正确的HTML，也不可能创建一个能够识别属性的表达式，更不用说所有可能的格式错误变体了。即使不尝试处理无效的引号缺失，您的正则表达式已经非常不可读；深入到现实世界HTML的恐怖中去，您会发疯的这是一堆无法维持的不可靠的表达

现有的库既可以读取损坏的HTML，也可以将其更正为有效的XHTML，然后您可以使用XML解析器轻松地处理这些库。使用它们。

尽管不通过regexp解析HTML的建议是有效的，但下面有一个表达式可以很好地满足您的要求：

/
   \G                     # start where the last match left off
   (?>                    # begin non-backtracking expression
       .*?                # *anything* until...
       <[Aa]\b            # an anchor tag
    )??                   # but look ahead to see that the rest of the expression
                          #    does not match.
    \s+                   # at least one space
    ( \p{Alpha}           # Our first capture, starting with one alpha
      \p{Alnum}*          # followed by any number of alphanumeric characters
    )                     # end capture #1
    (?: \s* = \s*         # a group starting with a '=', possibly surrounded by spaces.
        (?: (['"])        # capture a single quote character
            (.*?)         # anything else
             \2           # which ever quote character we captured before
        |   ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
        )                 # end group
     )?                   # attribute value was optional
/msx;

另外，如果希望在Perl 5.10下运行替换（我认为是PCRE），可以将
```
\K
```
放在属性名的前面，而不必担心捕获所有要跳过的内容

不能对多个捕获使用同一名称。因此，不能对具有命名捕获的表达式使用量词

因此，要么不使用命名捕获：

(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+

这样做的缺点是，您必须在之后去掉前导和尾随引号。

类似的内容可能会有所帮助

'(\S+)\s*?=\s*([\'"])(.*?|)\2

提取元素：

var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]

我还需要这个，并编写了一个解析属性的函数，您可以从这里获得它：

（注意：它不使用正则表达式）

splattne

@VonC解决方案部分有效，但如果标签混合了unquoted和quoted，则会出现一些问题

这一条适用于混合属性

$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

检验一下

<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

$code = '    <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
    ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);
var_dump( $ms );

$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/>      ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);

var_dump( $ms );

我已经创建了一个可以提取任何HTML标记属性的程序。它还可以处理没有值的属性，如

禁用

，还可以通过检查

内容

结果来确定标记是否为独立标记（没有结束标记）：

/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
    if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
    $matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
    $results = array(
        'element' => $matches[2],
        'attributes' => null,
        'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
    );
    if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
        $results['attributes'] = array();
        foreach($attrs[1] as $i => $attr) {
            $results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
        }
    }
    return $results;
}

/*！基于*/
函数提取html属性（$input）{
如果（！preg#u match（'#^（）（[\s\s]*？）（（））|（\s）*\/？（>）$#im'，$input，$matches））返回false；
$matches[5]=preg#u replace（'#（^|（\s）+）（[a-z0-9\-]+）（=）（“”（“）#i'，'$1$2$3$4$5$6'，$matches[5]）；
$results=数组(
'element'=>$matches[2]，
“属性”=>null，
'content'=>isset（$matches[8]）&&$matches[9]=''？$matches[8]：空
);
如果（preg_match_all（'#）（[a-z0-9\-]+）（=）（“”（.*）（？）？（？：（\s）|$）#i'，$matches[5]，$attrs））{
$results['attributes']=array（）；
foreach（$attrs[1]作为$i=>$attr）{
$results['attributes'][$attr]=isset（$attrs[5][$i]）&&！empty（$attrs[5][$i]）？（$attrs[5][$i]！=“”？$attrs[5][$i]：“”）:$attr；
}
}
返回$results；
}

测试代码

$test=array(
'',
'',
“测试内容”，
“测试内容”，
“测试内容”，
“测试内容”，
'',
'',
'',
“”，
'',
'',
'',
“选项1”
);
foreach（$t作为测试）{
var_dump（$t，extract_html_属性（$t））；
回声“”；
}

看看这个

也许您可以遍历DOM并获得所需的属性。它对我来说很好，可以从body标签

PHP（PCRE）和Python中获取属性简单属性提取（）：

这对我来说很有效。它还考虑了我遇到的一些最终案例

我使用这个正则表达式作为XML解析器

(?<=\s)[^><:\s]*=*(?=[>,\s])

'(\S+)\s*?=\s*([\'"])(.*?|)\2

var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]

$(htmlStr).attr('style')

$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"

$code = '    <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
    ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);
var_dump( $ms );

$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/>      ';

preg_match_all( "@$pat_attributes@isU", $code, $ms);

var_dump( $ms );

$keys = $ms[1];
$values = $ms[2];

/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
    if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
    $matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
    $results = array(
        'element' => $matches[2],
        'attributes' => null,
        'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
    );
    if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
        $results['attributes'] = array();
        foreach($attrs[1] as $i => $attr) {
            $results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
        }
    }
    return $results;
}

$test = array(
    '<div class="foo" id="bar" data-test="1000">',
    '<div>',
    '<div class="foo" id="bar" data-test="1000">test content</div>',
    '<div>test content</div>',
    '<div>test content</span>',
    '<div>test content',
    '<div></div>',
    '<div class="foo" id="bar" data-test="1000"/>',
    '<div class="foo" id="bar" data-test="1000" />',
    '< div  class="foo"     id="bar"   data-test="1000"       />',
    '<div class id data-test>',
    '<id="foo" data-test="1000">',
    '<id data-test>',
    '<select name="foo" id="bar" empty-value-test="" selected disabled><option value="1">Option 1</option></select>'
);

foreach($test as $t) {
    var_dump($t, extract_html_attributes($t));
    echo '<hr>';
}

((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\\)"|[^"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))

(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)

(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)

(?<=\s)[^><:\s]*=*(?=[>,\s])

(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2

(\S+)\s*=\s*([']|["])([\W\w]*?)\2

<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns 
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">

<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>

<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>

Match:  <div '>

<div(?:\".*?\"|'.*?'|.*?)*?>

<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)

(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

<tag 
   attrnovalue 
   attrnoquote=bli 
   attrdoublequote="blah 'blah'"
   attrsinglequote='bloob "bloob"' >

attr(?=(attr)*\s*/?\s*>)

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?

\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)