从Python中提取和格式化站点数据_Python_Html_Web Scraping

从Python中提取和格式化站点数据

python html web-scraping

从Python中提取和格式化站点数据,python,html,web-scraping,Python,Html,Web Scraping,这是针对Python3.5.x的我要找的是在一段HTML代码之后找到标题 <h3 class = "title-link__title"><span class="title=link__text">News Here</span> with urllib.request.urlopen('http://www.bbc.co.uk/news') as r: HTML = r.read() HTML = list(HTML) for

这是针对Python3.5.x的我要找的是在一段HTML代码之后找到标题

<h3 class = "title-link__title"><span class="title=link__text">News Here</span>

with urllib.request.urlopen('http://www.bbc.co.uk/news') as r:
    HTML = r.read()
    HTML = list(HTML)
    for i in range(len(HTML)):
        HTML[i] = chr(HTML[i])

这里有新闻
使用urllib.request.urlopen（'http://www.bbc.co.uk/news“）作为r：
HTML=r.read（）
HTML=列表（HTML）
对于范围内的i（len（HTML））：
HTML[i]=chr（HTML[i]）

我怎样才能得到它，所以我只提取标题，因为这是我所需要的。我将尽我所能帮助您了解详细信息。

从网页获取信息称为

网页抓取
图书馆是完成这项工作的最佳工具之一
你试过使用正则表达式吗？另外，您可能希望明确说明您希望该程序从上述HTML中提取什么。谢谢，但我已经使用BeautifulSoup实现了它，并且我正在寻找经常更改的标题。
from bs4 import BeautifulSoup
import urllib

#opening page
r = urllib.urlopen('http://www.bbc.co.uk/news').read()
#creating soup
soup = BeautifulSoup(r)

#useful for understanding the layout of your page info
#print soup.prettify()

#creating a ResultSet with all h3 tags that contains a class named 'title-link__title'
a = soup.findAll("h3", {"class":"title-link__title"})

#counting ocurrences
len(a)
#result = 44

#get text of first header
a[0].text
#result = u'\nMay v Leadsom to be next UK PM\n'

#get text of second header
a[1].text
#result = u'\nVideo shows US police shooting aftermath\n'