如何让python代码在抓取网站时正确循环到下一页？_Python

如何让python代码在抓取网站时正确循环到下一页？

python

如何让python代码在抓取网站时正确循环到下一页？,python,Python,我试图刮一个房地产网站，但我有问题，让我的代码进入下一页（共25页）。目前，它只是在不断地抓取第1页。我是一个相当大的新手，所以如果这是一个愚蠢的请求，我道歉 import requests from bs4 import BeautifulSoup from csv import writer base_url = 'https://www.rew.ca/properties/areas/kelowna-bc' for i in range(1,26): url = '/page/

我试图刮一个房地产网站，但我有问题，让我的代码进入下一页（共25页）。目前，它只是在不断地抓取第1页。我是一个相当大的新手，所以如果这是一个愚蠢的请求，我道歉

import requests
from bs4 import BeautifulSoup
from csv import writer

base_url = 'https://www.rew.ca/properties/areas/kelowna-bc'

for i in range(1,26):
    url = '/page/' + str(i)

    while url:
        response = requests.get(f"{base_url}{url}")
        soup = BeautifulSoup(response.text, "html.parser")
        listings = soup.find_all("article")

        with open("property4.csv", "w") as csv_file:
            csv_writer = writer(csv_file)
            csv_writer.writerow(["title", "type", "price", "location", "bedrooms", "bathrooms", "square feet", "link"])
        for listing in listings:
            location = listing.find(class_="displaypanel-info").get_text().strip()
            price = listing.find(class_="displaypanel-title hidden-xs").get_text().strip()
            link = listing.find("a").get('href').strip()
            title = listing.find("a").get('title').strip()
            type = (listing.find(class_="clearfix hidden-xs").find(class_="displaypanel-info")).get_text()
            bedrooms = (listing.find_all("li")[2]).get_text()
            bathrooms = (listing.find_all("li")[3]).get_text()
            square_feet = (listing.find_all("li")[4]).get_text()
            csv_writer.writerow([title, type, price, location, bedrooms, bathrooms, square_feet, link])
            next_btn = soup.find(class_="paginator-next_page paginator-control")
            url = next_btn.find("a")["href"]

您的循环有两个问题

压痕

find（）

语句的缩进使代码在每页中多次找到按钮，这是不必要的

while循环

while循环会阻止您从第1页转到第2页，因为即使您找到了下一页，url也是真的。简单地去掉这个

以下是一个固定版本：

import requests
from bs4 import BeautifulSoup
from csv import writer

base_url = 'https://www.rew.ca/properties/areas/kelowna-bc'

for i in range(1,26):
    url = '/page/' + str(i)

    response = requests.get(f"{base_url}{url}")
    soup = BeautifulSoup(response.text, "html.parser")
    listings = soup.find_all("article")        
    #do you csv work here
    next_btn = soup.find(class_="paginator-next_page paginator-control")
    url = next_btn.find("a")["href"]
    print(url)

为了稍微开发代码，我将csv逻辑分解为一个函数，并使用while循环而不是for循环。这样做的好处是，如果更多的列表使分页更长或更短，则不需要更新循环

当我尝试我的代码时，我发现域要求你的请求速度不超过每5秒一页，所以我在抓取之间增加了5秒的延迟

import requests
import time
from bs4 import BeautifulSoup as soup

def parse_listing(page_html):
  listings = soup.find_all("article")
  with open("property4.csv", "w") as csv_file:
    csv_writer = writer(csv_file)
    csv_writer.writerow(["title", "type", "price", "location", "bedrooms", "bathrooms", "square feet", "link"])

  for listing in listings:
    location = listing.find(class_="displaypanel-info").get_text().strip()
    price = listing.find(class_="displaypanel-title hidden-xs").get_text().strip()
    link = listing.find("a").get('href').strip()
    title = listing.find("a").get('title').strip()
    type = (listing.find(class_="clearfix hidden-xs").find(class_="displaypanel-info")).get_text()
    bedrooms = (listing.find_all("li")[2]).get_text()
    bathrooms = (listing.find_all("li")[3]).get_text()
    square_feet = (listing.find_all("li")[4]).get_text()
    csv_writer.writerow([title, type, price, location, bedrooms, bathrooms, square_feet, link])

prefix = 'https://www.rew.ca'
d = soup(requests.get('https://www.rew.ca/properties/areas/kelowna-bc').text, 'html.parser')

while True:
  parse_listing(d)
  next_page=d.find('a', {'rel': 'next'})
  if next_page:
      href_link=next_page.get('href')
      print(href_link)
      d= soup(requests.get(prefix + href_link).text, 'html.parser')
      time.sleep(5)
  else:
      print("no more 'next page'")
      break

这样的事情应该行得通。它并不漂亮，但希望它能帮助你看到它是如何在页面中旋转的

导入请求
从bs4导入BeautifulSoup
从csv导入编写器
导入时间
##使用实际的基本url，因为从页面返回的url是/properties/areas/kelowna bc/page/XX
基本url=https://www.rew.ca'
url='/properties/areas/kelowna bc/page/1'
打开（“property4.csv”、“w”）作为csv\u文件：
csv\u writer=writer（csv\u文件）
csv_writer.writerow（[“标题”、“类型”、“价格”、“位置”、“卧室”、“浴室”、“平方英尺”、“链接]）
而url：
时间。睡眠（5）##不确定速度有多慢，但如果你刮得太快，网站将开始返回429。
response=requests.get（f“{base_url}{url}”）
print（f“{response}，{response.url}”）#调试--帮助显示实际请求的页面。
response.raise_for_status（）#如果我们没有返回200，这将引发异常。
soup=BeautifulSoup（response.text，“html.parser”）
listings=soup.find_all（“文章”）
要在列表中列出，请执行以下操作：
location=listing.find（class=“displaypanel info”）.get_text（）.strip（）.split（）##您需要决定如何处理这些
price=listing.find（class=“displaypanel title hidden xs”）.get_text（）.strip（）
link=listing.find（“a”）.get（'href'）.strip（）
title=listing.find（“a”）.get（“title”）.strip（）
type=（listing.find（class=“clearfix hidden xs”）.find（class=“displaypanel info”）.get_text（）
#并非所有的物品都包含浴室和平方英尺
零件=列表。查找所有（“li”）
卧室=（部分[2]）。如果len（部分）>=3，则获取_text（），否则无
浴室=（零件[3]）。如果len（零件）>=4，则获取_text（），否则无
平方英尺=（部分[4]）。如果len（部分）>=5，则获取文本（），否则无
csv_writer.writerow（[标题、类型、价格、位置、卧室、浴室、平方英尺、链接]）
print（f{title:url=next_btn.find（“a”）[“href”]
是一个非常广泛的匹配项。您能确认它没有找到返回到第1页的链接吗？您正在调用soup.find（class=“paginator…”）
但是soup不是完整的文档吗？如果它在那里，它总是会返回一个按钮？很难说，因为当我运行它时，你的代码正在崩溃，因为它试图访问一个不存在的元素，但那可能只是我。我似乎在最后一页上找不到它。我想问题是下一个\u btn=soup.find（class=”分页器-下一页（“分页器控制”）即使按钮不起作用，仍在最后一页上。不确定这是否有意义？@sin triba soup是完整的文档，但我认为我的代码足够动态，可以获得每个页面的文档？这是我的错误，但您的代码会崩溃吗？当我在url=next_btn.find（'a'）['href']之后打印url时
我得到了/properties/areas/kelowna bc/page/2
，但是/properties/areas/kelowna bc/
已经是基本url的一部分，所以如果它崩溃了，那么修复它。如果它卡在循环中，那么我不知道。