Web scraping 如果类不同并且包含不同的内容,如何从classess中提取内容并按时间顺序将其添加到列表中?
我有两种情况需要在抓取代码时进行不同的处理。 2个类似的类都包含建筑物的价格,需要按时间顺序添加到excel中,因为它们必须与我正在收集的其他数据相匹配 我正在清除数据的属性有两个不同的类。 一个是这样的:Web scraping 如果类不同并且包含不同的内容,如何从classess中提取内容并按时间顺序将其添加到列表中?,web-scraping,beautifulsoup,text-extraction,Web Scraping,Beautifulsoup,Text Extraction,我有两种情况需要在抓取代码时进行不同的处理。 2个类似的类都包含建筑物的价格,需要按时间顺序添加到excel中,因为它们必须与我正在收集的其他数据相匹配 我正在清除数据的属性有两个不同的类。 一个是这样的: <div class="xl-price rangePrice"> 375.000 € </div> 这就是我试图获取价格并将其添加到列表中的
<div class="xl-price rangePrice">
375.000 €
</div>
这就是我试图获取价格并将其添加到列表中的内容
仅使用两个值中的一个时的输出(在本例中,我显示了在“Except”值中运行的代码的输出:
每个“价格”都表示一个新的页面。但正如你在第3页中看到的,它并不完整,只显示它遇到的第一个值,这是一个单一的价格,但不接受两个价格值
- 当价格超过1时,我取该价格的平均值,然后将其附加到价格表中
非常感谢!此脚本从页面
1
到10
获取数据,并将其保存为csv文件。价格为平均价格(如果找到多个广告):
LibreOffice Calc中的文件如下所示:
联机查看输出:
截图:
你能分享网页的URL吗?或者如果不能,你能编辑文章并添加相关的HTML片段和预期输出吗?完成!我的错,我忘记添加网站提取路径。@Moofinexplorer我想我以前在同一个网站上帮助过你?无论如何,通过使用post Requeue直接从这里调用
API
,让你的生活更轻松st.这回答了你的问题吗?哈哈,你刚刚写了我花了一个多星期的时间在几分钟内试图弄明白的代码吗?这是一个对肠胃的打击,是一个令人惊奇的学习经历,现在我可以看到你是如何做到我一直在尝试做的事情的了!非常感谢!如果你不介意的话,我会尽快看一看您可以了解您所做的一些事情:),查看
<div class="xl-price-promotion rangePrice">
<span>from </span> 250.000 € <br><span>to</span> 695.000 €
</div>
for number in range(1, 4):
listplace = (number - 1) * len(buildinglist1)
immo_page = requests.get(f'https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={number}',
headers=header)
soup = Beautiful
Soup(immo_page.content, 'lxml') # html parser
pricelist = ['Price']
for item in soup.findAll('div', attrs={'class': 'xl-price'}):
# item = item.text.strip().split()
try:
for item in soup.findAll('div', attrs={'class': 'xl-price-promotion rangePrice'}):
temp_list = []
item = item.text.strip().split()
item.remove('from'), item.remove('€'), item.remove('to'), item.remove('€')
for price in item: temp_list.append(price.replace('.', ''))
print(temp_list)
temp_list = [int(temp_list[0]) + int(temp_list[1])]
print(temp_list)
for item in temp_list: pricelist.append(item / 2)
except ValueError:
for item in soup.findAll('div', attrs={'class': 'xl-price rangePrice'}):
item = item.contents[0]
item = item.strip()[0:-1]
item = item.replace(' ', '')
item = item.replace('.', '')
pricelist.append(item)
print(pricelist)
['Price', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000']
['Price', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000']
['Price', '235000']
import re
import csv
import requests
from bs4 import BeautifulSoup
from statistics import mean
url = 'https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={}'
data = []
for page in range(1, 10):
soup = BeautifulSoup(requests.get(url.format(page)).text, 'html.parser')
for result, price, surface, desc, link in zip( soup.select('.title-bar-left'),
soup.select('.rangePrice'),
soup.select('.xl-surface-ch, .l-surface-ch, .m-surface-ch'),
soup.select('.xl-desc, .l-desc, .m-desc'),
soup.select('.result-xl > a[target="IWEB_MAIN"], .result-l > a[target="IWEB_MAIN"], .result-m > a[target="IWEB_MAIN"]') ):
s = (re.findall('\s*(.*?m²)\s*', surface.get_text(strip=True)) or '-')[0]
bed = (re.findall('\s*([\s\d\-]+bed.)\s*', surface.get_text(strip=True)) or '-')[0]
old_price = price.select_one('.old-price')
if old_price:
old_price.extract()
price = mean( [int(''.join(re.findall(r'\d+', v))) for v in re.findall(r'\s*(.*?)\s*€', price.text)] )
data.append([result.get_text(strip=True),
price,
s, bed, desc.get_text(strip=True)])
print('{:<65} {:<10} {:<20} {:<20} {:<70}'.format(*data[-1]))
data[-1] += [link['href']]
with open('output.csv', 'w') as f_out:
writer = csv.writer(f_out, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerows(data)
Apartment 275000 70 m² 2 bed. energiezuinig app, hartje Leuven, 2 slpk, fietsenstalling
Apartment 298000 84 m² 2 bed. App. 2 slpk in de unieke residentie Keizershof!
Apartment 535000 80 m² 2 bed. appartement
Flat/Studio 145000 32 m² 1 bed. studio
Flat/Studio 159000 22 m² 1 bed. studio
Apartment 487000 149 m² 3 bed. Modern spatious apartment within the ring of Leuven
Flat/Studio 189000 30 m² 1 bed. flat
Apartment 325000 75 m² 2 bed. appartement
Flat/Studio 139000 23 m² 1 bed. studio
Apartment 499000 104 m² 2 bed. appartement
Apartment 249500 95 m² 2 bed. appartement
... and so on.
import requests
from bs4 import BeautifulSoup
import csv
types = []
sqs = []
prices = []
des = []
links = []
for url in range(1, 11):
print(f"Extracting Page# {url}")
r = requests.get(
f"https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={url}")
soup = BeautifulSoup(r.text, 'html.parser')
for ty in soup.findAll('div', attrs={'class': 'title-bar-left'}):
ty = ty.text.strip()
types.append(ty)
for sq in soup.select('div[class*="surface-ch"]'):
sq = sq.text.strip()
if 'm²' in sq:
sq = sq[0:sq.find('m')]
else:
sq = 'N/A'
sqs.append(sq)
for price in soup.select('div[class*="-price"]'):
price = price.get_text(strip=True)
if 'from' in price:
price = price.replace('from', 'From: ')
price = price.replace('to', ' To: ')
else:
price = price[0:price.find('€') + 1]
prices.append(price)
for de in soup.select('div[class*="-desc"]'):
de = de.get_text(strip=True)
des.append(de)
for url in soup.findAll('a'):
url = url.get('href')
if url is not None and 'for-sale/leuven/3000/id' in url:
links.append(url)
final = []
for item in zip(types, sqs, prices, des, links):
final.append(item)
with open('output.csv', 'w+', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Type', 'Size', 'Price', 'Desc', 'Link'])
writer.writerows(final)
print("Operation Completed")