在python中列出某些HTML标记？_Python_Regex_Regex Negation

在python中列出某些HTML标记？

python regex

在python中列出某些HTML标记？,python,regex,regex-negation,Python,Regex,Regex Negation,比如说允许的\u位=['a'，'p'] re.compile(r'<(%s)[^>]*(/>|.*?</\1>)' % ('|'.join(allowed_bits))) re.compile（r']*（/>|.*）'（'|'.join（允许的位）））匹配项： <a href="blah blah">blah</a> <p /> 而不是： <html>blah blah blah</html>

比如说

允许的\u位=['a'，'p']

re.compile(r'<(%s)[^>]*(/>|.*?</\1>)' % ('|'.join(allowed_bits)))

re.compile（r']*（/>|.*）'（'|'.join（允许的位）））

匹配项：

<a href="blah blah">blah</a>
<p />

而不是：

<html>blah blah blah</html>

<p>Hello</p>

废话废话

我想做的是把它的头转过来，这样它就匹配了

<html>blah blah</html>
<script type="text/javascript">blah blah</script>

废话
废话

而不是：

<html>blah blah blah</html>

<p>Hello</p>

你好

我的想法是这样做：

re.compile(r'<(**^**%s)[^>]*(/>|.*?</\1>)' % ('|'.join(allowed_bits)))

re.compile（r']*（/>|.*）'（'|'.join（允许的位）））

但这不起作用

有什么想法吗？我想反向匹配。

使用：

re.compile（r']（/>|.？）'%（'|'.join（允许的_位）））

不要使用正则表达式解析[X][HT]ML。它永远无法可靠地工作。尤其不要使用正则表达式过滤HTML标记作为安全措施。使用适当的XML或HTML解析器（如BeautifulSoup）。