Regex-仅在HTML中匹配标记名_Html_Regex

Regex-仅在HTML中匹配标记名

html regex

Regex-仅在HTML中匹配标记名,html,regex,Html,Regex,如何使用正则表达式检索html代码段中的所有html标记名？如果有必要的话，我会用PHP来做这件事。例如： <div id="someid"> <img src="someurl" /> <br /> <p>some content</p> </div> 一些内容应该返回：div，img，br，p.我想这应该有用。。。我马上就来试试：编辑：删除了\s+（感谢Peteris） preg

如何使用正则表达式检索html代码段中的所有html标记名？如果有必要的话，我会用PHP来做这件事。例如：

<div id="someid">
     <img src="someurl" />
     <br />
     <p>some content</p>
</div>




一些内容

应该返回：div，img，br，p.

我想这应该有用。。。我马上就来试试：

编辑：删除了

\s+

（感谢Peteris）

preg_match_all（'/]*>/'，$html，$matched_元素）；

正则表达式可能并不总是有效。如果您100%确信它是格式良好的XHTML，那么正则表达式可能是实现这一点的一种方法。如果没有，请使用某种PHP库来完成。在C#中，有一种称为HTML敏捷包的东西，例如，请参阅。PHP中可能有一个等效的工具。

这应该适用于大多数格式良好的标记，前提是您不在CDATA区域，也没有玩过重新定义实体的恶作剧：

# nasty, ugly, illegible, unmaintable — NEVER USE THIS STYLE!!!!
/<\w+(?:\s+\w+=(?:\S+|(['"])(?:(?!\1).)*?\1))*\s*\/?>/s

是的，它变长了，但随着时间的延长，它变得更易于维护，而不是更少它也更正确。现在，它所使用的实际程序所做的不仅仅是这些，因为您必须考虑的内容要比实际HTML中的内容多得多，例如CDATA和编码以及对实体的顽皮的重新定义。然而，与流行的观点相反，您实际上可以用PHP做这类事情，因为它使用PCRE，它允许

（？（DEFINE）…）

块和递归模式。在我的答案、和中，我有更严肃的例子

好的，很好，你读了所有这些吗，或者至少看了一眼？还和我在一起吗？你好别忘了呼吸。好了，你现在没事了。：）

当然，有一个很大的灰色地带，在那里，可能会让位给不可取的，而且比它让位给不可能要快得多。如果这些答案中的这些例子，更不用说当前答案中的这些，超出了您当前的模式匹配技能水平，那么您可能应该使用其他方法，这通常意味着让其他人为您完成

在python中，一种解决方案是使用正则表达式在html中获取所有不同的标记名

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

重新导入
s=”“”


一些内容
"""
打印（set（re.findall（'它在上不起作用。修复程序是'/|\s+[^>]*>）/'
。它在上不起作用"/>
@CanSpice:那又怎样？别逼我教你怎么做！另外，除了数据我们还知道什么吗？不。你很可能不知道这些数据，它们可能根本就不是开放式的。@tchrist:所以他应该使用HTML解析器来解析HTML。他应该使用正确的工具来完成这项工作。@CanSpice:我不准备这么说。我告诉过我们e编辑HTML时，在vi
中搜索和替换。如果允许，那么当然应该允许您在HTML上使用模式匹配。如果不允许，那么您不应该允许在这些文件上使用vi
。我承认，HTML是文本复杂的文本，但仍然只是文本。没有任何东西在vi
中编写：//，///s//
是错误的，因此用您选择的编程语言编写等价的代码也没有什么错误。停止在非文本解决方案中强迫新手。我使用/color=“#000000”之类的东西
和：g/tchrist，事实上，如果您有格式良好的XHTML，您上面提到的一次性搜索可能不会成为问题。但是为了获得更健壮的解决方案，我决定使用php的domdocument类作为解析器。没有遇到（？（定义）…）以前。你知道它是只支持Perl和PCRE，还是有其他支持它的实现吗？（我从谷歌那里没有得到任何有用的东西。）@彼得：是的，这些都是不可能在谷歌上搜索到的东西，因为非字母数字被丢弃了，大小写被忽略了。事情发生时我没有注意到，但我的直觉是Perl是从PCRE获得的，而不是从另一个方向。我不知道还有什么东西支持它。它有很多问题-如果你看一下我的程序，我被迫复制通用子例程定义的内容，因为缺少名称空间控制/巡逻-但它仍然非常酷。
# broken out into related elements grouped by whitespace via /x
/ < \w+ (?: \s+ \w+ = (?: \S+ | (['"]) (?: (?! \1) . ) *? \1 )) * \s* \/? > /xs

/ 
   # start of tag, with named ident
   < \w+ 
   # now with unlimited k=v pairs 
   #    where k is \w+ 
   #      and v is either \S+ or else quoted 
   (?: \s+ \w+ = (?: \S+        # either an unquoted value, 
                   | ( ['"] )   # or else first pick either quote
                     (?: 
                        (?! \1) .  # anything that isn't our quote, including brackets
                     ) * ?     # maximal should probably work here
                     \1        # till we see it again
                 ) 
   )  *    # as many k=v pairs as we can find
   \s *    # tolerate closing whitespace

   \/ ?    # XHTML style close tag
   >       # finally done
/xs

my $img_rx = qr{

    # save capture in $+{TAG} variable
    (?<TAG> (?&image_tag) )

    # remainder is pure declaration
    (?(DEFINE)

        (?<image_tag>
            (?&start_tag)
            (?&might_white) 
            (?&attributes) 
            (?&might_white) 
            (?&end_tag)
        )

        (?<attributes>
            (?: 
                (?&might_white) 
                (?&one_attribute) 
            ) *
        )

        (?<one_attribute>
            \b
            (?&legal_attribute)
            (?&might_white) = (?&might_white) 
            (?:
                (?&quoted_value)
              | (?&unquoted_value)
            )
        )

        (?<legal_attribute> 
            (?: (?&required_attribute)
              | (?&optional_attribute)
              | (?&standard_attribute)
              | (?&event_attribute)
              # for LEGAL parse only, comment out next line 
              | (?&illegal_attribute)
            )
        )

        (?<illegal_attribute> \b \w+ \b )

        (?<required_attribute>
            alt
          | src
        )

        (?<optional_attribute>
            (?&permitted_attribute)
          | (?&deprecated_attribute)
        )

        # NB: The white space in string literals 
        #     below DOES NOT COUNT!   It's just 
        #     there for legibility.

        (?<permitted_attribute>
            height
          | is map
          | long desc
          | use map
          | width
        )

        (?<deprecated_attribute>
             align
           | border
           | hspace
           | vspace
        )

        (?<standard_attribute>
            class
          | dir
          | id
          | style
          | title
          | xml:lang
        )

        (?<event_attribute>
            on abort
          | on click
          | on dbl click
          | on mouse down
          | on mouse out
          | on key down
          | on key press
          | on key up
        )

        (?<unquoted_value> 
            (?&unwhite_chunk) 
        )

        (?<quoted_value>
            (?<quote>   ["']      )
            (?: (?! \k<quote> ) . ) *
            \k<quote> 
        )

        (?<unwhite_chunk>   
            (?:
                # (?! [<>'"] ) 
                (?! > ) 
                \S
            ) +   
        )

        (?<might_white>     \s *   )

        (?<start_tag>  
            < (?&might_white) 
            img 
            \b       
        )

        (?<end_tag>          
            (?&html_end_tag)
          | (?&xhtml_end_tag)
        )

        (?<html_end_tag>       >  )
        (?<xhtml_end_tag>    / >  )

    )

}six;

import re

s = """<div id="someid">
       <img src="someurl" />
       <br />
       <p>some content</p>
       </div>
    """

print(set(re.findall('<(\w+)', s)))
# {'p', 'img', 'div', 'br'}
or 
print({i.replace('<', '') for i in re.findall('(<\w+)',s)})
# {'p', 'img', 'div', 'br'}