Web scraping 如果类不同并且包含不同的内容,如何从classess中提取内容并按时间顺序将其添加到列表中?

Web scraping 如果类不同并且包含不同的内容,如何从classess中提取内容并按时间顺序将其添加到列表中?,web-scraping,beautifulsoup,text-extraction,Web Scraping,Beautifulsoup,Text Extraction,我有两种情况需要在抓取代码时进行不同的处理。 2个类似的类都包含建筑物的价格,需要按时间顺序添加到excel中,因为它们必须与我正在收集的其他数据相匹配 我正在清除数据的属性有两个不同的类。 一个是这样的: <div class="xl-price rangePrice"> 375.000 € </div> 这就是我试图获取价格并将其添加到列表中的

我有两种情况需要在抓取代码时进行不同的处理。 2个类似的类都包含建筑物的价格,需要按时间顺序添加到excel中,因为它们必须与我正在收集的其他数据相匹配

我正在清除数据的属性有两个不同的类。 一个是这样的:

<div class="xl-price rangePrice">
                                375.000 €  
                            </div>
这就是我试图获取价格并将其添加到列表中的内容

仅使用两个值中的一个时的输出(在本例中,我显示了在“Except”值中运行的代码的输出:

每个“价格”都表示一个新的页面。但正如你在第3页中看到的,它并不完整,只显示它遇到的第一个值,这是一个单一的价格,但不接受两个价格值

  • 当价格超过1时,我取该价格的平均值,然后将其附加到价格表中

非常感谢!

此脚本从页面
1
10
获取数据,并将其保存为csv文件。价格为平均价格(如果找到多个广告):

LibreOffice Calc中的文件如下所示:

联机查看输出:

截图:


你能分享网页的URL吗?或者如果不能,你能编辑文章并添加相关的HTML片段和预期输出吗?完成!我的错,我忘记添加网站提取路径。@Moofinexplorer我想我以前在同一个网站上帮助过你?无论如何,通过使用post Requeue直接从这里调用
API
,让你的生活更轻松st.这回答了你的问题吗?哈哈,你刚刚写了我花了一个多星期的时间在几分钟内试图弄明白的代码吗?这是一个对肠胃的打击,是一个令人惊奇的学习经历,现在我可以看到你是如何做到我一直在尝试做的事情的了!非常感谢!如果你不介意的话,我会尽快看一看您可以了解您所做的一些事情:),查看
<div class="xl-price-promotion rangePrice">
                                <span>from </span> 250.000 € <br><span>to</span> 695.000 €  
                            </div>
    for number in range(1, 4):
        listplace = (number - 1) * len(buildinglist1)
        immo_page = requests.get(f'https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={number}',
                                 headers=header)
        soup = Beautiful

Soup(immo_page.content, 'lxml')  # html parser

     pricelist = ['Price']


        for item in soup.findAll('div', attrs={'class': 'xl-price'}):
            # item = item.text.strip().split()
            try:
                for item in soup.findAll('div', attrs={'class': 'xl-price-promotion rangePrice'}):
                    temp_list = []
                    item = item.text.strip().split()
                    item.remove('from'), item.remove('€'), item.remove('to'), item.remove('€')
                    for price in item: temp_list.append(price.replace('.', ''))
                    print(temp_list)
                    temp_list = [int(temp_list[0]) + int(temp_list[1])]
                    print(temp_list)
                    for item in temp_list: pricelist.append(item / 2)
            except ValueError:
                for item in soup.findAll('div', attrs={'class': 'xl-price rangePrice'}):
                    item = item.contents[0]
                    item = item.strip()[0:-1]
                    item = item.replace(' ', '')
                    item = item.replace('.', '')
                    pricelist.append(item)
        print(pricelist)
['Price', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000', '275000', '298000', '535000', '145000', '159000', '487000', '189000', '325000', '139000', '499000', '520000', '249500', '448000', '215000', '225000', '210000', '215000', '218000', '232000', '689000', '228000', '299500', '135000', '549000', '125000', '169000', '160000', '395000', '430000', '210000']
['Price', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000', '210000', '325000', '375000', '135000', '385000', '285000', '339000', '125000', '225000', '635000', '445000', '689000', '205000', '438000', '595000', '180000', '320000', '48000', '165000', '150000', '119000']
['Price', '235000']
import re
import csv
import requests
from bs4 import BeautifulSoup
from statistics import mean

url = 'https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={}'

data = []
for page in range(1, 10):
    soup = BeautifulSoup(requests.get(url.format(page)).text, 'html.parser')

    for result, price, surface, desc, link in zip( soup.select('.title-bar-left'),
                                        soup.select('.rangePrice'),
                                        soup.select('.xl-surface-ch, .l-surface-ch, .m-surface-ch'),
                                        soup.select('.xl-desc, .l-desc, .m-desc'),
                                        soup.select('.result-xl > a[target="IWEB_MAIN"], .result-l > a[target="IWEB_MAIN"], .result-m > a[target="IWEB_MAIN"]') ):
        s = (re.findall('\s*(.*?m²)\s*', surface.get_text(strip=True)) or '-')[0]
        bed = (re.findall('\s*([\s\d\-]+bed.)\s*', surface.get_text(strip=True)) or '-')[0]

        old_price = price.select_one('.old-price')
        if old_price:
            old_price.extract()

        price = mean( [int(''.join(re.findall(r'\d+', v))) for v in re.findall(r'\s*(.*?)\s*€', price.text)] )

        data.append([result.get_text(strip=True),
              price,
              s, bed, desc.get_text(strip=True)])

        print('{:<65} {:<10} {:<20} {:<20} {:<70}'.format(*data[-1]))

        data[-1] += [link['href']]

with open('output.csv', 'w') as f_out:
    writer = csv.writer(f_out, delimiter=',',
                        quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerows(data)
Apartment                                                         275000     70 m²                2 bed.               energiezuinig app, hartje Leuven, 2 slpk, fietsenstalling             
Apartment                                                         298000     84 m²                2 bed.               App. 2 slpk in de unieke residentie Keizershof!                       
Apartment                                                         535000     80 m²                2 bed.               appartement                                                           
Flat/Studio                                                       145000     32 m²                1 bed.               studio                                                                
Flat/Studio                                                       159000     22 m²                1 bed.               studio                                                                
Apartment                                                         487000     149 m²               3 bed.               Modern spatious apartment within the ring of Leuven                   
Flat/Studio                                                       189000     30 m²                1 bed.               flat                                                                  
Apartment                                                         325000     75 m²                2 bed.               appartement                                                           
Flat/Studio                                                       139000     23 m²                1 bed.               studio                                                                
Apartment                                                         499000     104 m²               2 bed.               appartement                                                           
Apartment                                                         249500     95 m²                2 bed.               appartement                                                           

... and so on.
import requests
from bs4 import BeautifulSoup
import csv

types = []
sqs = []
prices = []
des = []
links = []

for url in range(1, 11):
    print(f"Extracting Page# {url}")
    r = requests.get(
        f"https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000?page={url}")
    soup = BeautifulSoup(r.text, 'html.parser')
    for ty in soup.findAll('div', attrs={'class': 'title-bar-left'}):
        ty = ty.text.strip()
        types.append(ty)
    for sq in soup.select('div[class*="surface-ch"]'):
        sq = sq.text.strip()
        if 'm²' in sq:
            sq = sq[0:sq.find('m')]
        else:
            sq = 'N/A'
        sqs.append(sq)
    for price in soup.select('div[class*="-price"]'):
        price = price.get_text(strip=True)
        if 'from' in price:
            price = price.replace('from', 'From: ')
            price = price.replace('to', ' To: ')
        else:
            price = price[0:price.find('€') + 1]
        prices.append(price)
    for de in soup.select('div[class*="-desc"]'):
        de = de.get_text(strip=True)
        des.append(de)
    for url in soup.findAll('a'):
        url = url.get('href')
        if url is not None and 'for-sale/leuven/3000/id' in url:
            links.append(url)
final = []
for item in zip(types, sqs, prices, des, links):
    final.append(item)
with open('output.csv', 'w+', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Type', 'Size', 'Price', 'Desc', 'Link'])
    writer.writerows(final)
    print("Operation Completed")