Python中的HTML文件解析
我有一个非常长的html文件,看起来就像这样-。我希望能够解析该文件,以便以元组的形式获取信息 例如:Python中的HTML文件解析,python,html,beautifulsoup,nltk,Python,Html,Beautifulsoup,Nltk,我有一个非常长的html文件,看起来就像这样-。我希望能够解析该文件,以便以元组的形式获取信息 例如: <tr> <td>Cech</td> <td>Chelsea</td> <td>30</td> <td>£6.4</td> </tr> 它只是将所有标记的html文件剥离,并给出如下内容: Cech
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
它只是将所有标记的html文件剥离,并给出如下内容:
Cech
Chelsea
30
£6.4
尽管我可以编写一段糟糕的代码,读取每一行并将其分配给元组。我想不出任何解决方案也可以包含播放器的位置(字符串出现在
标签中)。如有任何解决方案/建议,将不胜感激
我倾向于使用元组的原因是我可以使用解包,并计划用解包后的值填充MySQl表
from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
输出:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]
我不确定上面的代码是用来做什么的。我在问题中使用ntlk模块发布的代码与您的代码完全相同。事实上,你的代码甚至完全删除了后卫、中场和前锋的标签,这在我的输出中是需要的。我想这就是你想要的?如果让人困惑,让我知道添加评论。看起来正是我需要的。然而,我的输出看起来是这样的
(u‘守门员’、u‘切赫’、u‘切尔西’、u‘30’、u’\xa36.4')
。这是我缺少python插件的cz吗?我是在python3.3中做的。python2.x以不同的方式处理字符串/unicodeu
代表unicode。让我们假设您现在看到,根据答案,ntlk是用于此项工作的错误工具。我尝试使用nltk,因为我在使用它时遇到了困难。看起来很简单,但给了我一个递归错误。花了一段时间才明白问题出在哪里
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]