Python 漂亮的汤，从维基百科获取表格数据_Python_Web Scraping_Beautifulsoup_Web Crawler

Python 漂亮的汤，从维基百科获取表格数据

python web-scraping web-crawler

Python 漂亮的汤，从维基百科获取表格数据,python,web-scraping,beautifulsoup,web-crawler,Python,Web Scraping,Beautifulsoup,Web Crawler,我正在阅读Seppe vanden Broucke和Bart Baesens的书《数据科学最佳实践和Python示例的实用Web抓取》下一个代码应该是从维基百科获取数据，维基百科是一个权力游戏集的列表： import requests from bs4 import BeautifulSoup url = 'https://en.wikipedia.org/w/index.php' + \ '?title=List_of_Game_of_Thrones_episodes&oldid=8

我正在阅读Seppe vanden Broucke和Bart Baesens的书《数据科学最佳实践和Python示例的实用Web抓取》

下一个代码应该是从维基百科获取数据，维基百科是一个权力游戏集的列表：

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all(['th','td']):
                values.append(col.text)
                if values:
                    episode_dict = {headers[i]: values[i] for i in
                                    range(len(values))}
                    episodes.append(episode_dict)
                    for episode in episodes:
                        print(episode)

但在运行代码时，下一个错误显示：

{'No.overall'：'1'}

索引器回溯（最后一次最近调用）

in
20如果值：
21集_dict={headers[i]：中i的值[i]
--->22范围（len（值））}
23集。附加（第集）
24对于每集中的每集：
英寸（.0）
19.附加值（列文本）
20如果值：
--->21集_dict={headers[i]：中i的值[i]
22范围（len（值））}
23集。附加（第集）
索引器：列表索引超出范围

有人能告诉我为什么会发生这种情况吗？

你的追踪是

{'No.overall': '1'}
Traceback (most recent call last):
  File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
    episode_dict = {headers[i]: values[i] for i in
  File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
    episode_dict = {headers[i]: values[i] for i in
IndexError: list index out of range

问题不在于代码，而在于代码的缩进。第三个

for

循环应与第二个

循环平行，而不在第二个for
循环内。书中就是这样写的
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
        range(len(values))}
        episodes.append(episode_dict)
# Show the results
for episode in episodes:
 print(episode)

谢谢，非常感谢karlcow，我将在类名中挖掘更多内容，以便进一步练习。关于试图提取的内容，该书声称代码，我引用：“现在，让我们尝试解决以下用例。你会注意到，我们的《权力的游戏》维基百科页面有许多维护良好的表格，列出了书信的导演、作者、播出日期和观众数量。让我们尝试获取所有这些数据“我看到你的回答有效地获取了剧集标题，我将尝试获取书中提出的所有数据，你试图构建哪种数据结构？是一个列表，但正如你所提到的，这只是一个没有很好复制的代码。我通常在学习时输入代码，而不是复制代码，以便更好地理解和熟悉语法，这一次，我不会忘记错误的识别会导致什么后果。啊哈。最初的代码没有被很好地复制。好的。：）很棒的Ananth，你是对的，我会更加小心识别，我对python是新手，甚至书上都说识别是新手最常见的错误之一，这个错误帮助我在寻找其他答案之前仔细检查识别，但是，我认为这样我学到了更多，感谢你帮助Ananth，还有@karlcow，这两个答案都有助于我计算并更好地理解发生了什么
<td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
        range(len(values))}
        episodes.append(episode_dict)
# Show the results
for episode in episodes:
 print(episode)