Python 漂亮的汤,从维基百科获取表格数据

Python 漂亮的汤,从维基百科获取表格数据,python,web-scraping,beautifulsoup,web-crawler,Python,Web Scraping,Beautifulsoup,Web Crawler,我正在阅读Seppe vanden Broucke和Bart Baesens的书《数据科学最佳实践和Python示例的实用Web抓取》 下一个代码应该是从维基百科获取数据,维基百科是一个权力游戏集的列表: import requests from bs4 import BeautifulSoup url = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=8

我正在阅读Seppe vanden Broucke和Bart Baesens的书《数据科学最佳实践和Python示例的实用Web抓取》

下一个代码应该是从维基百科获取数据,维基百科是一个权力游戏集的列表:

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all(['th','td']):
                values.append(col.text)
                if values:
                    episode_dict = {headers[i]: values[i] for i in
                                    range(len(values))}
                    episodes.append(episode_dict)
                    for episode in episodes:
                        print(episode)
但在运行代码时,下一个错误显示:

{'No.overall':'1'}

索引器回溯(最后一次最近调用)

in
20如果值:
21集_dict={headers[i]:中i的值[i]
--->22范围(len(值))}
23集。附加(第集)
24对于每集中的每集:
英寸(.0)
19.附加值(列文本)
20如果值:
--->21集_dict={headers[i]:中i的值[i]
22范围(len(值))}
23集。附加(第集)
索引器:列表索引超出范围
有人能告诉我为什么会发生这种情况吗?

你的追踪是

{'No.overall': '1'}
Traceback (most recent call last):
  File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
    episode_dict = {headers[i]: values[i] for i in
  File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
    episode_dict = {headers[i]: values[i] for i in
IndexError: list index out of range

问题不在于代码,而在于代码的缩进。第三个
for
循环应与第二个
循环平行,而不在第二个
for
循环内。书中就是这样写的

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
        range(len(values))}
        episodes.append(episode_dict)
# Show the results
for episode in episodes:
 print(episode)

谢谢,非常感谢karlcow,我将在类名中挖掘更多内容,以便进一步练习。关于试图提取的内容,该书声称代码,我引用:“现在,让我们尝试解决以下用例。你会注意到,我们的《权力的游戏》维基百科页面有许多维护良好的表格,列出了书信的导演、作者、播出日期和观众数量。让我们尝试获取所有这些数据“我看到你的回答有效地获取了剧集标题,我将尝试获取书中提出的所有数据,你试图构建哪种数据结构?是一个列表,但正如你所提到的,这只是一个没有很好复制的代码。我通常在学习时输入代码,而不是复制代码,以便更好地理解和熟悉语法,这一次,我不会忘记错误的识别会导致什么后果。啊哈。最初的代码没有被很好地复制。好的。:)很棒的Ananth,你是对的,我会更加小心识别,我对python是新手,甚至书上都说识别是新手最常见的错误之一,这个错误帮助我在寻找其他答案之前仔细检查识别,但是,我认为这样我学到了更多,感谢你帮助Ananth,还有@karlcow,这两个答案都有助于我计算并更好地理解发生了什么
<td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
        range(len(values))}
        episodes.append(episode_dict)
# Show the results
for episode in episodes:
 print(episode)