使用Python使用try/except从网站中删除作者姓名_Python_Python 3.x_Web Scraping_Beautifulsoup_Try Except

使用Python使用try/except从网站中删除作者姓名

python python-3.x web-scraping

使用Python使用try/except从网站中删除作者姓名,python,python-3.x,web-scraping,beautifulsoup,try-except,Python,Python 3.x,Web Scraping,Beautifulsoup,Try Except,我试图使用Try/Except来浏览包含作者数据的URL的不同页面。我需要从这个网站的10个后续页面的作者姓名集 # Import Packages import requests import bs4 from bs4 import BeautifulSoup as bs # Output list authors = [] # Website Main Page URL URL = 'http://quotes.toscrape.com/' res = requests.get(URL)

我试图使用Try/Except来浏览包含作者数据的URL的不同页面。我需要从这个网站的10个后续页面的作者姓名集

# Import Packages
import requests
import bs4
from bs4 import BeautifulSoup as bs
# Output list
authors = [] 
# Website Main Page URL
URL = 'http://quotes.toscrape.com/'
res = requests.get(URL)
soup = bs4.BeautifulSoup(res.text,"lxml")
# Get the contents from the first page
for item in soup.select(".author"):
    authors.append(item.text)
page = 1
pagesearch = True
# Get the contents from 2-10 pages
while pagesearch:
    # Check if page is available
    try:
            req = requests.get(URL + '/' + 'page/' + str(page) + '/')
            soup = bs(req.text, 'html.parser')
            page = page + 1
            for item in soup.select(".author"): # Append the author class from the webpage html
                authors.append(item.text)  
    except:
        print("Page not found")
        pagesearch == False
        break # Break if no page is remaining

print(set(authors)) # Print the output as a unique set of author names

第一页的URL中没有任何页码，所以我将其单独处理。我使用try/except块迭代所有可能的页面，并在扫描最后一页时抛出异常并中断循环

当我运行程序时，它进入一个无限循环，当页面结束时，它需要打印“页面未找到”消息。当我中断内核时，我看到正确的结果是一个列表和我的异常语句，但在此之前什么都没有。我得到以下结果

Page not found
{'Allen Saunders', 'J.K. Rowling', 'Pablo Neruda', 'J.R.R. Tolkien', 'Harper Lee', 'J.M. Barrie', 
 'Thomas A. Edison', 'J.D. Salinger', 'Jorge Luis Borges', 'Haruki Murakami', 'Dr. Seuss', 'George 
  Carlin', 'Alexandre Dumas fils', 'Terry Pratchett', 'C.S. Lewis', 'Ralph Waldo Emerson', 'Jim 
  Henson', 'Suzanne Collins', 'Jane Austen', 'E.E. Cummings', 'Jimi Hendrix', 'Khaled Hosseini', 
 'George Eliot', 'Eleanor Roosevelt', 'André Gide', 'Stephenie Meyer', 'Ayn Rand', 'Friedrich 
  Nietzsche', 'Mother Teresa', 'James Baldwin', 'W.C. Fields', "Madeleine L'Engle", 'William 
  Nicholson', 'George R.R. Martin', 'Marilyn Monroe', 'Albert Einstein', 'George Bernard Shaw', 
 'Ernest Hemingway', 'Steve Martin', 'Martin Luther King Jr.', 'Helen Keller', 'Charles M. Schulz', 
 'Charles Bukowski', 'Alfred Tennyson', 'John Lennon', 'Garrison Keillor', 'Bob Marley', 'Mark 
  Twain', 'Elie Wiesel', 'Douglas Adams'}

这是什么原因呢？谢谢。

我想那是因为确实有一页。当浏览器上没有可显示的页面时，可能会出现异常。但当你提出这个要求时：

http://quotes.toscrape.com/page/11/

然后，浏览器显示一个页面，bs4仍然可以解析该页面以获取元素

如何停在第11页？您可以跟踪“下一页”按钮的存在

感谢阅读。

尝试使用内置功能从第1-10页转到：

import requests
from bs4 import BeautifulSoup

url = "http://quotes.toscrape.com/page/{}/"
authors = []

for page in range(1, 11):
    response = requests.get(url.format(page))
    print("Requesting Page: {}".format(response.url))
    soup = BeautifulSoup(response.content, "html.parser")
    for tag in soup.select(".author"):
        authors.append(tag.text)

print(set(authors))

是的，这种方法将工作，但是，我正在寻找一个解决方案，也适合在一个网站可能的变化。谢谢你的回答。是的，这是一个很好的方法。谢谢你的回答。