如何在不知道最后一个页面的情况下在python中迭代多个页面_Python_For Loop_Web Scraping_Beautifulsoup

如何在不知道最后一个页面的情况下在python中迭代多个页面

python for-loop web-scraping

如何在不知道最后一个页面的情况下在python中迭代多个页面,python,for-loop,web-scraping,beautifulsoup,Python,For Loop,Web Scraping,Beautifulsoup,我想使用BeautifulSoup获取信息，并在多个页面中进行迭代。我知道如何通过为范围（1，3）内的页面编写来实现这一点，例如，如果我想要前两页上的信息。但是，信息是动态的，页面数量将增加。那么，当我不知道最后一页时，如何进行迭代呢？目前我有以下代码： import pandas as pd import numpy as np import requests from bs4 import BeautifulSoup headers = {'user-agent': 'Mozilla/5.

我想使用BeautifulSoup获取信息，并在多个页面中进行迭代。我知道如何通过为范围（1，3）内的页面编写

来实现这一点，例如，如果我想要前两页上的信息。但是，信息是动态的，页面数量将增加。那么，当我不知道最后一页时，如何进行迭代呢？目前我有以下代码：

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

headers = {'user-agent': 'Mozilla/5.0'}

listing_details = []

for page in range(1,3):
    response = requests.get('https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={}&pm=1'.format(page), headers=headers)
    listings = BeautifulSoup(response.content, "lxml")
    
    details = listings.findAll('div', attrs={"data-test":"tile"})
    for detail in details:

        # get property links
        links = detail.findAll('a', href=True)
        for link in links:
            link="https://www.realestate.co.nz" + link['href']

        listing_details.append([link])

df3 = pd.DataFrame(listing_details, columns=['Link'])
print(df3)

通常，站点上的UI（用户界面）分页框将包含有关此特定搜索查询的预期内容的一些信息。因此，您可以从分页框中获取最后一页的编号，将其保存在变量中，比如说

last\u page

，然后对范围（1，last\u page+1）中的页面执行

，

，以在整个范围内迭代。请注意，确定最后一页的页数本身就是一项任务

这是UI分页

因此，在某些情况下，尝试手动（使用浏览器）跳出页面范围，并在其中查找某些内容，这将告诉您已超出范围。例如，“未找到页面”或其他错误，寻找异常等可能是明智的。然后您可以在页面上循环，直到满足此条件。

一种解决方案可能是在页面中创建while循环和iter，为每个循环添加+1。当页面内容断开或状态代码404时，断开。

您可以执行一段时间的True操作，然后在

下一个

元素不再存在时断开（或添加所选的结束页码-哪一个先出现）

您还可以根据脚本标记中提供的信息进行计算（这显示了添加目标页数的想法）：

导入请求，重新
从bs4导入BeautifulSoup作为bs
输入数学
目标页面=3
将requests.Session（）作为s：
s、 headers={'User-Agent'：'Mozilla/5.0'}
r=s.get（f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page=1&pm=1')
计数=重新搜索（r'“totalResults\\”：（？P\d+），\\“resultsPerPage\\”：（？P\d+），，r.text，re.M）
num_pages=math.ceil（int（counts.group（'total'））/int（counts.group（'perpage'））
#打印（整数（计数组（'total'））
打印（页数）
n=2
谢谢你的想法。对于UI分页框，我不确定我是否理解这一点，您能解释更多或者为我指出正确的方向吗？我还查看了“超出范围”页面（ie），看到页面上写着“这里没什么可看的”。你是在建议我写一个循环语句吗？@Ilovenoodles该网站www.realstate.co.nz可能确实为用户提供了一些页面导航，以便他们可以手动浏览页面。因此，查找UI（用户界面）分页框，然后从中获取最大的页码，并在范围（1，biggets\u page\u number+1）@Ilovenoodles中进行迭代，您可以在超出范围时查看站点提供的内容，以便在每次迭代中都可以查看它。如果你因为某种原因不能从UI中获取最大的页码，那就是这种情况。很酷，谢谢。我会看一看。嗨，谢谢你的建议。但不知怎的，我得到了200状态码，即使页面超出范围。我想我需要尝试一些其他的方法。感谢您的时间和努力。感谢您的解决方案！我发现你的第一种方法非常有效，在我的情况下很容易理解。非常欢迎你
import requests
from bs4 import BeautifulSoup as bs

page = 0

with requests.Session() as s:

    s.headers = {'User-Agent':'Mozilla/5.0'}

    while True:
        page+=1
        r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={page}&pm=1')
        soup = bs(r.content, 'lxml')
        next_page = soup.select_one('[data-test=next-link]')
        
        if next_page is None:
            break
        print(page)

import requests, re
from bs4 import BeautifulSoup as bs
import math

target_pages = 3

with requests.Session() as s:
    
    s.headers = {'User-Agent':'Mozilla/5.0'}
    r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page=1&pm=1')
    counts = re.search(r'"totalResults\\":(?P<total>\d+),\\"resultsPerPage\\":(?P<perpage>\d+),', r.text, re.M)
    num_pages = math.ceil(int(counts.group('total'))/int(counts.group('perpage')))
    #print(int(counts.group('total')))
    print(num_pages)
    n = 2
    
    while n <= min(num_pages, target_pages):
        r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={n}&pm=1')
        print(n)
        n+=1