来自HTML表的Python dict

来自HTML表的Python dict,python,beautifulsoup,Python,Beautifulsoup,我试图使用BeautifulSoup将HTML表转换为python dict。但是由于该表具有多个级别,因此无法正确保存信息 以下是您尝试过的内容: from bs4 import BeautifulSoup url = 'https://www.imdb.com/title/tt8579674/awards' response = requests.get(url) html_soup = BeautifulSoup(response.text, 'html.parser') award_

我试图使用BeautifulSoup将HTML表转换为python dict。但是由于该表具有多个级别,因此无法正确保存信息

以下是您尝试过的内容:

from bs4 import BeautifulSoup

url = 'https://www.imdb.com/title/tt8579674/awards'
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')

award_list = []

for table in html_soup.find_all('table', {'class': 'awards'}):
    for tr in table.find_all('tr'):
        for title_award_outcome in tr.find_all('td', {'class': 'title_award_outcome'}):
            award_name = title_award_outcome.get_text(separator='<br/>', 
                                                      strip=True).split('<br/>', 1)[1]            

        for award_description in tr.find_all('td', {'class': 'award_description'}):
            award_description = award_description.get_text(separator='<br/>', 
                                                           strip=True).split('<br/>', 1)[0]
            award = award_name+'_'+award_description

        for title_award_outcome in tr.find_all('td', {'class': 'title_award_outcome'}):
            result = title_award_outcome.get_text(separator='<br/>', strip=True).split('<br/>', 1)[0]

            award_dict[award] = result
            award_list.append(award_dict)

print(award_list)
[{'Golden Globe_Best Motion Picture - Drama': 'Winner', 
  'Golden Globe_Best Original Score - Motion Picture': 'Nominee', 
  'Golden Globe_Best Original Score - Motion Picture': 'Nominee', 
  'BAFTA Film Award_Best Director': 'Nominee',
  'BAFTA Film Award_Outstanding British Film of the Year': 'Nominee',
   etc, etc, etc}]

要创建所需词典,可以使用以下示例:

import requests
from bs4 import BeautifulSoup

url = 'https://www.imdb.com/title/tt8579674/awards'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = {}
for td in soup.select('.awards td'):
    outcome, cat = td.select_one('.title_award_outcome b'), td.select_one('.award_category')
    if outcome and cat:
        current = []
        out[(outcome.get_text(strip=True), cat.get_text(strip=True))] = current
    else:
        for a in td.select('a'):
            a.extract()
        current.append(td.contents[0].strip())

# transform the dict to desired structure:
out2 = {}
for (outcome, award), v in out.items():
    for i in v:
        out2['{}_{}'.format(award, i)] = outcome

# print it
from pprint import pprint
pprint(out2)
印刷品:

{'AACTA International Award_Best Direction': 'Nominee',
 'AFCA Award_Best Cinematography': 'Winner',
 'AFCA Award_Best Film Editing': 'Nominee',
 'AFCA Award_Best Score': 'Winner',
 'AFCC Award_Best Cinematography': 'Winner',
 'AFCC Award_Best Original Score': 'Winner',
 'AFCC Award_Top Ten Films': 'Nominee',
 'AFI Award_Movie of the Year': 'Winner',
 'ALFS Award_British/Irish Actor of the Year': 'Nominee',
 'ALFS Award_British/Irish Film of the Year': 'Nominee',
 'ALFS Award_Director of the Year': 'Nominee',
 'ALFS Award_Film of the Year': 'Nominee',

...and so on.

输出似乎有点不寻常。这是一张单子里面有一张单子?像“金球奖\最佳电影”这样的按键似乎很奇怪。为什么不用
{“金球奖”:{“最佳电影”:{“戏剧”:“赢家”}}}
等制作嵌套的dict呢?这与原始数据的嵌套相匹配,对于快速查找似乎更有用(使用建议的结构似乎需要对任何查找进行线性搜索)<代码>奖励规则未在代码中定义。是的,@ggorlen。我必须将这个列表和电影中的其他信息一起添加到另一个dict中。