如何使用python提取文本,包括链接和链接后的文本以及br后的另一个文本
我已经解析了下面的如何使用python提取文本,包括链接和链接后的文本以及br后的另一个文本,python,beautifulsoup,Python,Beautifulsoup,我已经解析了下面的字符串,以从中提取数据,但无法提取部分数据。尝试过不同的方法。我设法找出标记之间的文本、链接和每个链接外部的文本 <html> <body> <p align="left"> <font face="Arial, Helvetica, sans-serif" size="2"> <b> <font size="4"> GOVERNOR: </font
字符串
,以从中提取数据,但无法提取部分数据。尝试过不同的方法。我设法找出
标记之间的文本、链接和每个链接外部的文本
<html>
<body>
<p align="left">
<font face="Arial, Helvetica, sans-serif" size="2">
<b>
<font size="4">
GOVERNOR:
</font>
</b>
<br/>
</font>
<font face="Arial, Helvetica, sans-serif" size="2">
<a href="http://governor.alabama.gov/">
<strong>
Robert
Bentley (R)*
</strong>
</a>
- Ex-Morgan County Commissioner & State Correctional Officer
<strong>
<br/>
<a href="http://www.facebook.com/stacy.george.3139">
Stacy George
(R)
</a>
- Ex-Morgan County Commissioner & State Correctional Officer
<br/>
Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
<br/>
<a href="http://www.bassforbama.com/">
Kevin Bass (D)
</a>
- Businessman & Ex-Pro Baseball Player
<br/>
<a href="http://www.parkergriffithforcongress.com/">
Parker Griffith
(D)
</a>
- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
</strong>
</font>
</p>
</body>
</html>
上面的代码打印出如下内容:
> Robert
Bentley (R)*
http://governor.alabama.gov/
> Stacy George
(R)
http://www.facebook.com/stacy.george.3139
- Ex-Morgan County Commissioner & State Correctional Officer
> Kevin Bass (D)
http://www.bassforbama.com/
- Businessman & Ex-Pro Baseball Player
> Parker Griffith
(D)
http://www.parkergriffithforcongress.com/
- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
遗漏了第三项,即
Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
请问我如何使用BeautifulSoup来解决这个问题?
我曾尝试使用find_all(“br”)
执行此操作,但它不起作用,因为br
标记返回NoneType抓取每个链接之外的所有文本节点:
from itertools import takewhile
from bs4 import NavigableString
not_link = lambda t: getattr(t, 'name') not in ('a', 'strong')
for link in soup.find_all("a"):
print 'Link contents:'
text = link.text.strip()
for sibling in takewhile(not_link, link.next_siblings):
if isinstance(sibling, NavigableString):
text += unicode(sibling).strip()
else:
text += sibling.text.strip()
print text
这张照片是:
Link contents:
Robert
Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer
Link contents:
Stacy George
(R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
Link contents:
Kevin Bass (D)- Businessman & Ex-Pro Baseball Player
Link contents:
Parker Griffith
(D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
我很感激你的帮助,而且很有效。作为学习的一部分,有没有其他不使用itertools的方法呢?因为我是新来的,所以我在想,如果有其他方法而不导入其他东西呢?因为我是Python的初学者,从未使用过像itertools这样的高级工具。几周前才开始学习Python并挑战我自己。@user3428883:你可以使用for
循环来循环next_同胞
,当你到达不再感兴趣的下一个同胞时,使用break
来结束循环。@user3428883:这就是takewhile
的全部功能;循环next\u
并提供所有信息,直到lambda
函数返回False
,结束循环。
Link contents:
Robert
Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer
Link contents:
Stacy George
(R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
Link contents:
Kevin Bass (D)- Businessman & Ex-Pro Baseball Player
Link contents:
Parker Griffith
(D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican