从Python中提取和格式化站点数据
这是针对Python3.5.x的 我要找的是在一段HTML代码之后找到标题从Python中提取和格式化站点数据,python,html,web-scraping,Python,Html,Web Scraping,这是针对Python3.5.x的 我要找的是在一段HTML代码之后找到标题 <h3 class = "title-link__title"><span class="title=link__text">News Here</span> with urllib.request.urlopen('http://www.bbc.co.uk/news') as r: HTML = r.read() HTML = list(HTML) for
<h3 class = "title-link__title"><span class="title=link__text">News Here</span>
with urllib.request.urlopen('http://www.bbc.co.uk/news') as r:
HTML = r.read()
HTML = list(HTML)
for i in range(len(HTML)):
HTML[i] = chr(HTML[i])
这里有新闻
使用urllib.request.urlopen('http://www.bbc.co.uk/news“)作为r:
HTML=r.read()
HTML=列表(HTML)
对于范围内的i(len(HTML)):
HTML[i]=chr(HTML[i])
我怎样才能得到它,所以我只提取标题,因为这是我所需要的。我将尽我所能帮助您了解详细信息。从网页获取信息称为
网页抓取
图书馆是完成这项工作的最佳工具之一
你试过使用正则表达式吗?另外,您可能希望明确说明您希望该程序从上述HTML中提取什么。谢谢,但我已经使用BeautifulSoup实现了它,并且我正在寻找经常更改的标题。
from bs4 import BeautifulSoup
import urllib
#opening page
r = urllib.urlopen('http://www.bbc.co.uk/news').read()
#creating soup
soup = BeautifulSoup(r)
#useful for understanding the layout of your page info
#print soup.prettify()
#creating a ResultSet with all h3 tags that contains a class named 'title-link__title'
a = soup.findAll("h3", {"class":"title-link__title"})
#counting ocurrences
len(a)
#result = 44
#get text of first header
a[0].text
#result = u'\nMay v Leadsom to be next UK PM\n'
#get text of second header
a[1].text
#result = u'\nVideo shows US police shooting aftermath\n'