Python 如何使用Beauty soup和re查找包含特定文本的特定类的跨距？_Python_Regex_Beautifulsoup

Python 如何使用Beauty soup和re查找包含特定文本的特定类的跨距？

python regex

Python 如何使用Beauty soup和re查找包含特定文本的特定类的跨距？,python,regex,beautifulsoup,Python,Regex,Beautifulsoup,我如何找到所有span的'blue'类，其中包含以下格式的文本： 04/18/13 7:29pm 因此可以是： 04/18/13 7:29pm 或：就构建实现这一点的逻辑而言，这是我迄今为止得到的： new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all pattern = re.compile('<span class=\"blue\"

我如何找到所有span的

'blue'

类，其中包含以下格式的文本：

04/18/13 7:29pm

因此可以是：

04/18/13 7:29pm

或：

就构建实现这一点的逻辑而言，这是我迄今为止得到的：

new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
    result = re.findall(pattern, _)
    print result

并得到错误：

'TypeError: expected string or buffer'

这种模式似乎满足了您的需求：

>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)

模式=重新编译（'.*？（\d\d/\d\d/\d\d\d？：\d\d\w\w）） >>>pattern.match（'这里有很多我不需要的文本'） >>>pattern.match（'这是我需要的跨度，因为它包含04/18/13 7:29pm'）。groups（）（‘2013年4月18日下午7:29’，）

这是一个灵活的正则表达式，您可以使用：

"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"

例如：

>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
      for line in lines
        for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
          if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
    print i

04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM

>>重新导入
>>>从bs4导入BeautifulSoup
>>>html=”“”
这里有很多我不需要的文本
这是我需要的时间跨度，因为它包含04/18/13 7:29 pm
2013年4月19日下午7:30
2013年4月18日下午7:29
发布于2013年15月18日上午10:00
2013年4月20日10:31发布
发布于2013年4月1日17:09
"""
>>>soup=BeautifulSoup（html）
>>>lines=[i.get_text（）表示汤中的i.find_all（'span'，{'class'：'blue'}）]
>>>ok=[m.组（1）
排队
对于m in（重新搜索（r'（\d\d？/\d\d？/\d\d？\d？\s*\d\d？：\d\d[a | p | a | p][m | m]），第行）
如果m]
>>>嗯
[u'04/18/13 7:29pm'，u'04/19/13 7:30pm'，u'04/18/13 7:29pm'，u'15/18/2013 10:00AM'，u'04/20/13 10:31pm'，u'4/1/2013 17:09aM']
>>>对于我来说，ok：
打印i
2013年4月18日下午7:29
2013年4月19日下午7:30
2013年4月18日下午7:29
2013年15月18日上午10:00
2013年4月20日晚上10:31
2013年4月1日17:09

我不知道如何实现这一点，我将根据您的建议尝试的代码发布到了原始帖子中（请参见编辑2）。@user1063287尝试将第三行更改为

result=pattern.match（。）.groups（）

re.findall

需要一个字符串（就像前面调用

re.compile

时使用的字符串，而是给它一个已经编译过的正则表达式。实际上，您要编译两次模式。听起来

还不是字符串，您需要从

变量中提取实际字符串，然后才能在上使用正则表达式。）我假设您可以调用类似

.string

，尝试一些打印语句，例如

print

和

print dir（\ux）

为了弄清楚您现在使用的是什么类型的对象。@user1063287 Corey的答案为您提供了一个更全面的解释，您需要调用

\u

的方法是

获取文本（）

。但他提供了一个更完整的答案：）您得到的

AttributeError

来自正则表达式与字符串不匹配时，因此它返回

None

。这会导致代码调用

None.groups（）

这是不存在的。Corey的代码用他的行

解释了这一点，如果m:

，这就是为什么我将您引导到他的代码。希望这有帮助！我可以成功运行上面的确切代码，但它在我的实现中不起作用。我想这可能是因为原始源代码中的日期和时间之间有一个

04/18/13 7:29pm

。为了便于参考，我在原始的

'urlopen read object'

中添加了

。替换（“，”）

，效果很好。非常感谢（所有响应者！）。

'TypeError: expected string or buffer'

>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)

"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"

>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
      for line in lines
        for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
          if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
    print i

04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM

import re
from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""

# parse the html
soup = BeautifulSoup(html_doc)

# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})

# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]

# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
    m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
    if m:
        found_dates.append(m.group(1))

# print the dates we collected
for date in found_dates:
    print(date)

04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm