从Python中提取和格式化站点数据

从Python中提取和格式化站点数据,python,html,web-scraping,Python,Html,Web Scraping,这是针对Python3.5.x的 我要找的是在一段HTML代码之后找到标题 <h3 class = "title-link__title"><span class="title=link__text">News Here</span> with urllib.request.urlopen('http://www.bbc.co.uk/news') as r: HTML = r.read() HTML = list(HTML) for

这是针对Python3.5.x的 我要找的是在一段HTML代码之后找到标题

<h3 class = "title-link__title"><span class="title=link__text">News Here</span>

with urllib.request.urlopen('http://www.bbc.co.uk/news') as r:
    HTML = r.read()
    HTML = list(HTML)
    for i in range(len(HTML)):
        HTML[i] = chr(HTML[i])
这里有新闻
使用urllib.request.urlopen('http://www.bbc.co.uk/news“)作为r:
HTML=r.read()
HTML=列表(HTML)
对于范围内的i(len(HTML)):
HTML[i]=chr(HTML[i])

我怎样才能得到它,所以我只提取标题,因为这是我所需要的。我将尽我所能帮助您了解详细信息。

从网页获取信息称为
网页抓取

图书馆是完成这项工作的最佳工具之一


你试过使用正则表达式吗?另外,您可能希望明确说明您希望该程序从上述HTML中提取什么。谢谢,但我已经使用BeautifulSoup实现了它,并且我正在寻找经常更改的标题。
from bs4 import BeautifulSoup
import urllib

#opening page
r = urllib.urlopen('http://www.bbc.co.uk/news').read()
#creating soup
soup = BeautifulSoup(r)

#useful for understanding the layout of your page info
#print soup.prettify()

#creating a ResultSet with all h3 tags that contains a class named 'title-link__title'
a = soup.findAll("h3", {"class":"title-link__title"})

#counting ocurrences
len(a)
#result = 44

#get text of first header
a[0].text
#result = u'\nMay v Leadsom to be next UK PM\n'

#get text of second header
a[1].text
#result = u'\nVideo shows US police shooting aftermath\n'