Python 在网页上刮取多个页面
我正试图从互联网上提取数据。我的代码顺利通过第一个循环,打印数据并将其加载到文件中,但它不会打印下一页的数据。不是,我正在使用Python3笔记本。 这是我的python代码Python 在网页上刮取多个页面,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正试图从互联网上提取数据。我的代码顺利通过第一个循环,打印数据并将其加载到文件中,但它不会打印下一页的数据。不是,我正在使用Python3笔记本。 这是我的python代码 import urllib3 from bs4 import BeautifulSoup as soup from time import sleep from random import randint import pandas as pd http = urllib3.
import urllib3
from bs4 import BeautifulSoup as soup
from time import sleep
from random import randint
import pandas as pd
http = urllib3.PoolManager()
filename = "GautengForSale.csv"
f = open(filename, "w")
headers = "Description, Location, Price, Bedrooms, Bathrooms, Parking, FloorSize\n"
f.write(headers)
for page in range(1, 5):
url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
page_html = http.request('GET', url)
page_soup = soup(page_html.data)
containers = page_soup.findAll("div", {"class": "p24_content"})
sleep(randint(2,10))
for container in containers:
description_container = container.findAll("div", {"class": "p24_description"})
if not description_container:
continue
else:
description = description_container[0].text
location_container = container.findAll("span", {"class": "p24_location"})
location = location_container[0].text
price_container = container.findAll("div", {"class": "p24_price"})
price = price_container[0].text.strip()
bedrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
if not bedrooms_container:
bedrooms = 0
else:
bedrooms = bedrooms_container[0].text.strip()
bathrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
if not bathrooms_container:
bathrooms = 1
else:
bathrooms = bathrooms_container[0].text.strip()
parking_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
if not parking_container:
parking = 0
else:
parking = parking_container[0].text.strip()
floor_size_container = container.findAll("span", {"class": "p24_size", "title": "Floor Size"})
if not floor_size_container:
floor_size = "n/a"
else:
floor_size = floor_size_container[0].text.strip()
print(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")
f.write(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")
f.close()
我不确定哪里出错了。看起来p24\u内容类从第二页开始应用于span标记。解决办法可以是:
containers = page_soup.findAll(["div", "span"], {"class": "p24_content"})
。。。如果我读对了
也许还有更多的事情需要改变。我没有检查:看起来p24\u内容类从第二页开始应用于span标记。解决办法可以是:
containers = page_soup.findAll(["div", "span"], {"class": "p24_content"})
。。。如果我读对了
也许还有更多的事情需要改变。我没有检查:有两个问题:
一,。page_soup.findAlldiv,{class:p24_content}应为page_soup.select.p24_content:,因为该页面随此类而变化,并带有标记
二,。container.findAlldiv,{class:p24_description}应为container.select_one.p24_description,.p24_title,因为类p24_description仅出现在某些页面上
import requests
from bs4 import BeautifulSoup
for page in range(1, 5):
url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
page_soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
for container in page_soup.select(".p24_content"):
description_container = container.select_one(".p24_description, .p24_title")
if not description_container:
continue
else:
description = description_container.get_text(strip=True)
location_container = container.select_one(".p24_location")
location = location_container.get_text(strip=True)
price_container = container.select_one(".p24_price")
price = price_container.text.strip()
bedrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
if not bedrooms_container:
bedrooms = 0
else:
bedrooms = bedrooms_container.text.strip()
bathrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
if not bathrooms_container:
bathrooms = 1
else:
bathrooms = bathrooms_container.text.strip()
parking_container = container.find("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
if not parking_container:
parking = 0
else:
parking = parking_container.text.strip()
floor_size_container = container.find("span", {"class": "p24_size", "title": "Floor Size"})
if not floor_size_container:
floor_size = "n/a"
else:
floor_size = floor_size_container.text.strip()
print('{},{},{},{},{},{},{}'.format(description, location, price, bedrooms, bathrooms, parking, floor_size))
印刷品:
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
1 Bedroom Apartment inGrand Central,Grand Central,R 450 000,1,1,0,n/a
5 Bedroom House inWilro Park,Wilro Park,R 1 595 000,5,3,4,n/a
1 Bedroom Apartment inProtea Glen,Protea Glen,R 413 000,1,1,0,n/a
3 Bedroom Townhouse inWillowbrook,Willowbrook,R 1 350 000,3,2,4,n/a
2 Bedroom Apartment inWinchester Hills,Winchester Hills,R 650 000,2,1,1,69 m²
2 Bedroom Townhouse inElarduspark,Elarduspark,R 960 000,2,2,2,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
2 Bedroom Townhouse inProtea Glen,Protea Glen,R 565 000,2,1,1,50 m²
4 Bedroom House inSunninghill,Sunninghill,R 3 245 000,4,3.5,1,240 m²
1 Bedroom Apartment inRandpark Ridge,Randpark Ridge,R 807 700,1,1,1,51 m²
3 Bedroom House inGlenvista,Glenvista,R 2 500 000,3,2,3,n/a
4 Bedroom House inMeyersdal Nature Estate,Meyersdal Nature Estate,R 2 695 000,4,3,2,n/a
House,Geduld,R 750 000,0,1,0,n/a
3 Bedroom House,The Orchards,R 750 000,3,2,1,n/a
1 Bedroom Apartment,Kempton Park Central,POA,1,1,1,n/a
Apartment,Fourways,R 889 000,0,1,0,n/a
2 Bedroom Townhouse,Highveld,R 1 195 000,2,1.5,1,n/a
3 Bedroom House,Delville,R 1 300 000,3,1,5,n/a
5 Bedroom House,Northcliff,R 3 450 000,5,3.5,6,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
1 Bedroom House,Vlakfontein,R 170 000,1,1,1,n/a
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
3 Bedroom Apartment,Andeon,R 860 000,3,2,2,n/a
2 Bedroom Apartment,Vereeniging Central,R 435 000,2,1.5,1,77 m²
3 Bedroom House,Eldoraigne,R 1 750 000,3,2,3,n/a
3 Bedroom House,Moreleta Park,R 2 990 000,3,2.5,2,n/a
2 Bedroom Apartment,Kyalami Hills,R 1 235 000,2,2,1,97 m²
... and so on.
有两个问题:
一,。page_soup.findAlldiv,{class:p24_content}应为page_soup.select.p24_content:,因为该页面随此类而变化,并带有标记
二,。container.findAlldiv,{class:p24_description}应为container.select_one.p24_description,.p24_title,因为类p24_description仅出现在某些页面上
import requests
from bs4 import BeautifulSoup
for page in range(1, 5):
url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
page_soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
for container in page_soup.select(".p24_content"):
description_container = container.select_one(".p24_description, .p24_title")
if not description_container:
continue
else:
description = description_container.get_text(strip=True)
location_container = container.select_one(".p24_location")
location = location_container.get_text(strip=True)
price_container = container.select_one(".p24_price")
price = price_container.text.strip()
bedrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
if not bedrooms_container:
bedrooms = 0
else:
bedrooms = bedrooms_container.text.strip()
bathrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
if not bathrooms_container:
bathrooms = 1
else:
bathrooms = bathrooms_container.text.strip()
parking_container = container.find("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
if not parking_container:
parking = 0
else:
parking = parking_container.text.strip()
floor_size_container = container.find("span", {"class": "p24_size", "title": "Floor Size"})
if not floor_size_container:
floor_size = "n/a"
else:
floor_size = floor_size_container.text.strip()
print('{},{},{},{},{},{},{}'.format(description, location, price, bedrooms, bathrooms, parking, floor_size))
印刷品:
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
1 Bedroom Apartment inGrand Central,Grand Central,R 450 000,1,1,0,n/a
5 Bedroom House inWilro Park,Wilro Park,R 1 595 000,5,3,4,n/a
1 Bedroom Apartment inProtea Glen,Protea Glen,R 413 000,1,1,0,n/a
3 Bedroom Townhouse inWillowbrook,Willowbrook,R 1 350 000,3,2,4,n/a
2 Bedroom Apartment inWinchester Hills,Winchester Hills,R 650 000,2,1,1,69 m²
2 Bedroom Townhouse inElarduspark,Elarduspark,R 960 000,2,2,2,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
2 Bedroom Townhouse inProtea Glen,Protea Glen,R 565 000,2,1,1,50 m²
4 Bedroom House inSunninghill,Sunninghill,R 3 245 000,4,3.5,1,240 m²
1 Bedroom Apartment inRandpark Ridge,Randpark Ridge,R 807 700,1,1,1,51 m²
3 Bedroom House inGlenvista,Glenvista,R 2 500 000,3,2,3,n/a
4 Bedroom House inMeyersdal Nature Estate,Meyersdal Nature Estate,R 2 695 000,4,3,2,n/a
House,Geduld,R 750 000,0,1,0,n/a
3 Bedroom House,The Orchards,R 750 000,3,2,1,n/a
1 Bedroom Apartment,Kempton Park Central,POA,1,1,1,n/a
Apartment,Fourways,R 889 000,0,1,0,n/a
2 Bedroom Townhouse,Highveld,R 1 195 000,2,1.5,1,n/a
3 Bedroom House,Delville,R 1 300 000,3,1,5,n/a
5 Bedroom House,Northcliff,R 3 450 000,5,3.5,6,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
1 Bedroom House,Vlakfontein,R 170 000,1,1,1,n/a
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
3 Bedroom Apartment,Andeon,R 860 000,3,2,2,n/a
2 Bedroom Apartment,Vereeniging Central,R 435 000,2,1.5,1,77 m²
3 Bedroom House,Eldoraigne,R 1 750 000,3,2,3,n/a
3 Bedroom House,Moreleta Park,R 2 990 000,3,2.5,2,n/a
2 Bedroom Apartment,Kyalami Hills,R 1 235 000,2,2,1,97 m²
... and so on.
非常感谢。我已经补充了,但问题仍然没有解决。谢谢。我已经补充了这一点,但问题仍然没有解决。