多标记的Python正则表达式_Python_Html_Regex

多标记的Python正则表达式

python html regex

多标记的Python正则表达式,python,html,regex,Python,Html,Regex,我想知道如何从每个标记检索所有结果 import re htmlText = 'item1item2item3' print re.match('<p[^>]*size="[0-9]">(.*?)', htmlText).groups() 我需要的是： ('item1',

我想知道如何从每个

标记检索所有结果

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

我需要的是：

('item1', 'item2', 'item3')

您可以像这样使用

re.findall

：

import re
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.findall('<p[^>]*size="[0-9]">(.*?)</p>', html)
# This prints: ['item1', 'item2', 'item3']

重新导入
html='item1
item2
item3'
打印有关findall（']*size=“[0-9]”>（.*？），html）
#这将打印：['item1'，'item2'，'item3']

编辑：…但正如许多评论者所指出的，使用正则表达式解析HTML通常是个坏主意。

对于这种类型的问题，建议使用DOM解析器，而不是正则表达式

我经常看到推荐的Python

或者，如果

……这是一个结构合理的组织
…将其嵌入到单个根元素中

例如：

导入xml.dom.minidom >>>htmlText='

项目1

项目2

项目3

' >>>d=xml.dom.minidom.parseString（“%s”%htmlText） >>>元组（映射（lambda e:e.firstChild.wholeText，d.firstChild.childNodes））（‘项目1’、‘项目2’、‘项目3’）

漂亮的汤绝对是解决此类问题的好办法。代码更清晰，更易于阅读。一旦你安装了它，获得所有的标签看起来像这样

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

这将打印出p标签的所有值。

正则表达式的答案非常脆弱。这是一个证明（和一个正在运行的BeautifulSoup示例）

从美化组导入美化组
#这是你的HTML
html='item1
item2
item3'
#这里有一些简单的HTML，打破了你的习惯
#回答，但不要打断你的话。
#对于每个示例，正则表达式将忽略第一个标记。
html2='项目1
项目2
项目3'
html3='项目1
项目2
项目3'
html4='项目1
项目2
项目3'
#此BeautifulSoup代码适用于所有示例。
段落=美化组（html）.findAll（'p'）
items=[''.join（p.findAll（text=True））表示段落中的p]

使用BeautifulSoup.

谢谢！我刚在Python文档中找到它！对不起，这是一个糟糕的回答。如果size属性和右括号之间有空格怎么办：

？@Triptych:没有。你考虑过OP知道他在做什么的可能性吗？8-）如果问题是“如何解析此HTML？”那么我不会建议使用正则表达式。但是它是“我如何使我的正则表达式工作？”，这是对这个问题的回答。-1：给出了一个正则表达式解析html的例子，甚至没有说这真的很糟糕，很多新手都会阅读。邪恶来自于这样的行为。@RichieHindle：原始海报没有提到让正则表达式工作。他说他想从每个p标签中检索结果。正则表达式不适合这样做。-1用于尝试用正则表达式解析非正则语言。同意，难道没有一个python库，它以解析html而闻名吗？美丽的乌苏？HTMLIB？谢谢你的回复。我需要一种python方法来从一个小html打印出p标记的所有值，而无需在服务器中安装任何新的内容。另外，我很好奇你的例子提供了什么，除了列表理解之外，我的没有。Brett-mine会正确处理像item1

这样的情况，而你的会失败。此外，这里的items数组将转换为字符串列表，而您的示例将返回tag.contents，这实际上是一个（非常需要内存的）BeautifulSoup对象。酷！我不知道这个对象是内存密集型的，我只在小型解析项目中使用过它，从来没有遇到过问题。谢谢你的更新。根据你的解释，我对你的文件投了赞成票。我用BeautifulSoup处理了一些非常大（500KB+）的HTML文件，如果你不学会保存内存，你会遇到一堵相当困难的墙。BeautifulSoup非常方便，但效率不高。感谢您的回复。我只需要一种python方法来打印出p标记的所有值，而无需在服务器中安装任何新的内容。

>>> import xml.dom.minidom
>>> htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
>>> d = xml.dom.minidom.parseString('<not_p>%s</not_p>' % htmlText)
>>> tuple(map(lambda e: e.firstChild.wholeText, d.firstChild.childNodes))
('item1', 'item2', 'item3')

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

from BeautifulSoup import BeautifulSoup

# Here's your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here's some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]