如何使用python提取文本,包括链接和链接后的文本以及br后的另一个文本

如何使用python提取文本,包括链接和链接后的文本以及br后的另一个文本,python,beautifulsoup,Python,Beautifulsoup,我已经解析了下面的字符串,以从中提取数据,但无法提取部分数据。尝试过不同的方法。我设法找出标记之间的文本、链接和每个链接外部的文本 <html> <body> <p align="left"> <font face="Arial, Helvetica, sans-serif" size="2"> <b> <font size="4"> GOVERNOR: </font

我已经解析了下面的
字符串
,以从中提取数据,但无法提取部分数据。尝试过不同的方法。我设法找出
标记之间的文本、链接和每个链接外部的文本

<html>
 <body>
  <p align="left">
   <font face="Arial, Helvetica, sans-serif" size="2">
    <b>
     <font size="4">
      GOVERNOR:
     </font>
    </b>
    <br/>
   </font>
   <font face="Arial, Helvetica, sans-serif" size="2">
    <a href="http://governor.alabama.gov/">
     <strong>
      Robert 
                Bentley (R)*
     </strong>
    </a>
    - Ex-Morgan County Commissioner &amp; State Correctional Officer
    <strong>
     <br/>
     <a href="http://www.facebook.com/stacy.george.3139">
      Stacy George 
                (R)
     </a>
     - Ex-Morgan County Commissioner &amp; State Correctional Officer
     <br/>
     Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate &amp; '12 Scottsboro Mayor Candidate
     <br/>
     <a href="http://www.bassforbama.com/">
      Kevin Bass (D)
     </a>
     - Businessman &amp; Ex-Pro Baseball Player
     <br/>
     <a href="http://www.parkergriffithforcongress.com/">
      Parker Griffith 
                (D)
     </a>
     - Ex-Congressman, Ex-State Sen., Physician &amp; Ex-Republican
    </strong>
   </font>
  </p>
 </body>
</html>
上面的代码打印出如下内容:

> Robert 
                Bentley (R)*
      http://governor.alabama.gov/ 

>      Stacy George 
                (R)
      http://www.facebook.com/stacy.george.3139 
     - Ex-Morgan County Commissioner & State Correctional Officer

>      Kevin Bass (D)
      http://www.bassforbama.com/ 
     - Businessman & Ex-Pro Baseball Player


>      Parker Griffith 
                (D)
      http://www.parkergriffithforcongress.com/ 
     - Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
遗漏了第三项,即

Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate &amp; '12 Scottsboro Mayor Candidate
请问我如何使用BeautifulSoup来解决这个问题? 我曾尝试使用
find_all(“br”)
执行此操作,但它不起作用,因为
br
标记返回
NoneType

抓取每个链接之外的所有文本节点:

from itertools import takewhile
from bs4 import NavigableString

not_link = lambda t: getattr(t, 'name') not in ('a', 'strong')

for link in soup.find_all("a"):
    print 'Link contents:'
    text = link.text.strip()
    for sibling in takewhile(not_link, link.next_siblings):
        if isinstance(sibling, NavigableString):
            text += unicode(sibling).strip()
        else:
            text += sibling.text.strip()
    print text
这张照片是:

Link contents:
Robert 
                Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer
Link contents:
Stacy George 
                (R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
Link contents:
Kevin Bass (D)- Businessman & Ex-Pro Baseball Player
Link contents:
Parker Griffith 
                (D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican

我很感激你的帮助,而且很有效。作为学习的一部分,有没有其他不使用itertools的方法呢?因为我是新来的,所以我在想,如果有其他方法而不导入其他东西呢?因为我是Python的初学者,从未使用过像itertools这样的高级工具。几周前才开始学习Python并挑战我自己。@user3428883:你可以使用
for
循环来循环
next_同胞
,当你到达不再感兴趣的下一个同胞时,使用
break
来结束循环。@user3428883:这就是
takewhile
的全部功能;循环
next\u
并提供所有信息,直到
lambda
函数返回
False
,结束循环。
Link contents:
Robert 
                Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer
Link contents:
Stacy George 
                (R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
Link contents:
Kevin Bass (D)- Businessman & Ex-Pro Baseball Player
Link contents:
Parker Griffith 
                (D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican