Python 在网页上刮取多个页面_Python_Web Scraping_Beautifulsoup

Python 在网页上刮取多个页面

python web-scraping

Python 在网页上刮取多个页面,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从互联网上提取数据。我的代码顺利通过第一个循环，打印数据并将其加载到文件中，但它不会打印下一页的数据。不是，我正在使用Python3笔记本。这是我的python代码 import urllib3 from bs4 import BeautifulSoup as soup from time import sleep from random import randint import pandas as pd http = urllib3.

我正试图从互联网上提取数据。我的代码顺利通过第一个循环，打印数据并将其加载到文件中，但它不会打印下一页的数据。不是，我正在使用Python3笔记本。这是我的python代码

    import urllib3
    from bs4 import BeautifulSoup as soup
    from time import sleep
    from random import randint
    import pandas as pd
    http = urllib3.PoolManager()

filename = "GautengForSale.csv"
f = open(filename, "w")
headers = "Description, Location, Price, Bedrooms, Bathrooms, Parking, FloorSize\n"
f.write(headers)


for page in range(1, 5):
    
    url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
    page_html = http.request('GET', url)
    page_soup = soup(page_html.data)
    containers = page_soup.findAll("div", {"class": "p24_content"})
    
    sleep(randint(2,10))
    
    for container in containers:
        
        description_container = container.findAll("div", {"class": "p24_description"})
        if not description_container:
            continue
        else:
            description = description_container[0].text
    
        location_container = container.findAll("span", {"class": "p24_location"})
        location = location_container[0].text
   
        price_container = container.findAll("div", {"class": "p24_price"})
        price = price_container[0].text.strip()
        
        bedrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
        if not bedrooms_container:
            bedrooms = 0
        else:
            bedrooms = bedrooms_container[0].text.strip()
        
        bathrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
        if not bathrooms_container:
            bathrooms = 1
        else:
            bathrooms = bathrooms_container[0].text.strip()
        
        parking_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
        if not parking_container:
            parking = 0
        else:
            parking = parking_container[0].text.strip()
        
        floor_size_container = container.findAll("span", {"class": "p24_size", "title": "Floor Size"})
        if not floor_size_container:
            floor_size = "n/a"
        else:
            floor_size = floor_size_container[0].text.strip()

        print(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")
        f.write(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")

f.close()

我不确定哪里出错了。

看起来p24\u内容类从第二页开始应用于span标记。解决办法可以是：

containers = page_soup.findAll(["div", "span"], {"class": "p24_content"})

。。。如果我读对了

也许还有更多的事情需要改变。我没有检查：

看起来p24\u内容类从第二页开始应用于span标记。解决办法可以是：

containers = page_soup.findAll(["div", "span"], {"class": "p24_content"})

。。。如果我读对了

也许还有更多的事情需要改变。我没有检查：

有两个问题：

一,。page_soup.findAlldiv，{class:p24_content}应为page_soup.select.p24_content:，因为该页面随此类而变化，并带有标记

二,。container.findAlldiv，{class:p24_description}应为container.select_one.p24_description，.p24_title，因为类p24_description仅出现在某些页面上

import requests
from bs4 import BeautifulSoup


for page in range(1, 5):
    url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'

    page_soup = BeautifulSoup( requests.get(url).content, 'html.parser' )

    for container in page_soup.select(".p24_content"):
        description_container = container.select_one(".p24_description, .p24_title")
        if not description_container:
            continue
        else:
            description = description_container.get_text(strip=True)

        location_container = container.select_one(".p24_location")
        location = location_container.get_text(strip=True)

        price_container = container.select_one(".p24_price")
        price = price_container.text.strip()

        bedrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
        if not bedrooms_container:
            bedrooms = 0
        else:
            bedrooms = bedrooms_container.text.strip()

        bathrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
        if not bathrooms_container:
            bathrooms = 1
        else:
            bathrooms = bathrooms_container.text.strip()

        parking_container = container.find("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
        if not parking_container:
            parking = 0
        else:
            parking = parking_container.text.strip()

        floor_size_container = container.find("span", {"class": "p24_size", "title": "Floor Size"})
        if not floor_size_container:
            floor_size = "n/a"
        else:
            floor_size = floor_size_container.text.strip()

        print('{},{},{},{},{},{},{}'.format(description, location, price, bedrooms, bathrooms, parking, floor_size))

印刷品：

5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
1 Bedroom Apartment inGrand Central,Grand Central,R 450 000,1,1,0,n/a
5 Bedroom House inWilro Park,Wilro Park,R 1 595 000,5,3,4,n/a
1 Bedroom Apartment inProtea Glen,Protea Glen,R 413 000,1,1,0,n/a
3 Bedroom Townhouse inWillowbrook,Willowbrook,R 1 350 000,3,2,4,n/a
2 Bedroom Apartment inWinchester Hills,Winchester Hills,R 650 000,2,1,1,69 m²
2 Bedroom Townhouse inElarduspark,Elarduspark,R 960 000,2,2,2,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
2 Bedroom Townhouse inProtea Glen,Protea Glen,R 565 000,2,1,1,50 m²
4 Bedroom House inSunninghill,Sunninghill,R 3 245 000,4,3.5,1,240 m²
1 Bedroom Apartment inRandpark Ridge,Randpark Ridge,R 807 700,1,1,1,51 m²
3 Bedroom House inGlenvista,Glenvista,R 2 500 000,3,2,3,n/a
4 Bedroom House inMeyersdal Nature Estate,Meyersdal Nature Estate,R 2 695 000,4,3,2,n/a
House,Geduld,R 750 000,0,1,0,n/a
3 Bedroom House,The Orchards,R 750 000,3,2,1,n/a
1 Bedroom Apartment,Kempton Park Central,POA,1,1,1,n/a
Apartment,Fourways,R 889 000,0,1,0,n/a
2 Bedroom Townhouse,Highveld,R 1 195 000,2,1.5,1,n/a
3 Bedroom House,Delville,R 1 300 000,3,1,5,n/a
5 Bedroom House,Northcliff,R 3 450 000,5,3.5,6,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
1 Bedroom House,Vlakfontein,R 170 000,1,1,1,n/a
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
3 Bedroom Apartment,Andeon,R 860 000,3,2,2,n/a
2 Bedroom Apartment,Vereeniging Central,R 435 000,2,1.5,1,77 m²
3 Bedroom House,Eldoraigne,R 1 750 000,3,2,3,n/a
3 Bedroom House,Moreleta Park,R 2 990 000,3,2.5,2,n/a
2 Bedroom Apartment,Kyalami Hills,R 1 235 000,2,2,1,97 m²

... and so on.

有两个问题：

一,。page_soup.findAlldiv，{class:p24_content}应为page_soup.select.p24_content:，因为该页面随此类而变化，并带有标记

二,。container.findAlldiv，{class:p24_description}应为container.select_one.p24_description，.p24_title，因为类p24_description仅出现在某些页面上

import requests
from bs4 import BeautifulSoup


for page in range(1, 5):
    url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'

    page_soup = BeautifulSoup( requests.get(url).content, 'html.parser' )

    for container in page_soup.select(".p24_content"):
        description_container = container.select_one(".p24_description, .p24_title")
        if not description_container:
            continue
        else:
            description = description_container.get_text(strip=True)

        location_container = container.select_one(".p24_location")
        location = location_container.get_text(strip=True)

        price_container = container.select_one(".p24_price")
        price = price_container.text.strip()

        bedrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
        if not bedrooms_container:
            bedrooms = 0
        else:
            bedrooms = bedrooms_container.text.strip()

        bathrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
        if not bathrooms_container:
            bathrooms = 1
        else:
            bathrooms = bathrooms_container.text.strip()

        parking_container = container.find("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
        if not parking_container:
            parking = 0
        else:
            parking = parking_container.text.strip()

        floor_size_container = container.find("span", {"class": "p24_size", "title": "Floor Size"})
        if not floor_size_container:
            floor_size = "n/a"
        else:
            floor_size = floor_size_container.text.strip()

        print('{},{},{},{},{},{},{}'.format(description, location, price, bedrooms, bathrooms, parking, floor_size))

印刷品：

5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
1 Bedroom Apartment inGrand Central,Grand Central,R 450 000,1,1,0,n/a
5 Bedroom House inWilro Park,Wilro Park,R 1 595 000,5,3,4,n/a
1 Bedroom Apartment inProtea Glen,Protea Glen,R 413 000,1,1,0,n/a
3 Bedroom Townhouse inWillowbrook,Willowbrook,R 1 350 000,3,2,4,n/a
2 Bedroom Apartment inWinchester Hills,Winchester Hills,R 650 000,2,1,1,69 m²
2 Bedroom Townhouse inElarduspark,Elarduspark,R 960 000,2,2,2,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
2 Bedroom Townhouse inProtea Glen,Protea Glen,R 565 000,2,1,1,50 m²
4 Bedroom House inSunninghill,Sunninghill,R 3 245 000,4,3.5,1,240 m²
1 Bedroom Apartment inRandpark Ridge,Randpark Ridge,R 807 700,1,1,1,51 m²
3 Bedroom House inGlenvista,Glenvista,R 2 500 000,3,2,3,n/a
4 Bedroom House inMeyersdal Nature Estate,Meyersdal Nature Estate,R 2 695 000,4,3,2,n/a
House,Geduld,R 750 000,0,1,0,n/a
3 Bedroom House,The Orchards,R 750 000,3,2,1,n/a
1 Bedroom Apartment,Kempton Park Central,POA,1,1,1,n/a
Apartment,Fourways,R 889 000,0,1,0,n/a
2 Bedroom Townhouse,Highveld,R 1 195 000,2,1.5,1,n/a
3 Bedroom House,Delville,R 1 300 000,3,1,5,n/a
5 Bedroom House,Northcliff,R 3 450 000,5,3.5,6,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
1 Bedroom House,Vlakfontein,R 170 000,1,1,1,n/a
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
3 Bedroom Apartment,Andeon,R 860 000,3,2,2,n/a
2 Bedroom Apartment,Vereeniging Central,R 435 000,2,1.5,1,77 m²
3 Bedroom House,Eldoraigne,R 1 750 000,3,2,3,n/a
3 Bedroom House,Moreleta Park,R 2 990 000,3,2.5,2,n/a
2 Bedroom Apartment,Kyalami Hills,R 1 235 000,2,2,1,97 m²

... and so on.

非常感谢。我已经补充了，但问题仍然没有解决。谢谢。我已经补充了这一点，但问题仍然没有解决。