在Python中使用BeautifulSoup提取链接标记之间的文本
我的HTML代码如下所示:在Python中使用BeautifulSoup提取链接标记之间的文本,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我的HTML代码如下所示: <a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college stude
<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
如有任何见解,将不胜感激 你不能像这样使用
link.text
而不是link.contents
text = """
<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(text)
for link in soup.find_all('a', id='mylink'):
link_text = link.text
print link_text
您可以使用正则表达式查找所有文本
import urllib,urllib2,re
content=r'<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>''
links=re.findall(r'>(.*?)<',content)
a=""
for link in links:
a=a+link
print a
导入urllib、urllib2、re
内容=r''
links=re.findall(r'>(.*)
EZSTORAGE - PACK IT. STORE IT. WIN - Nationwide - Restrictions - Ends 6/30/15
import urllib,urllib2,re
content=r'<a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a>''
links=re.findall(r'>(.*?)<',content)
a=""
for link in links:
a=a+link
print a