Python 如何在硒中正确刮取物品？_Python_Selenium_Selenium Webdriver_Web Scraping

Python 如何在硒中正确刮取物品？

python selenium selenium-webdriver web-scraping

Python 如何在硒中正确刮取物品？,python,selenium,selenium-webdriver,web-scraping,Python,Selenium,Selenium Webdriver,Web Scraping,我目前正在做一个网页抓取项目，包括从Delhaize网站抓取产品、价格和可能的折扣。使用我的代码，我得到了正确数量的产品，但是有一些产品没有价格和折扣，为了应对这种情况，我尝试逐个产品，试图找到正确数量的产品价格。然而，我从来没有得到正确的数量，要么太多，要么太少你能帮我吗？我的代码如下： import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.options import Option

我目前正在做一个网页抓取项目，包括从Delhaize网站抓取产品、价格和可能的折扣。使用我的代码，我得到了正确数量的产品，但是有一些产品没有价格和折扣，为了应对这种情况，我尝试逐个产品，试图找到正确数量的产品价格。然而，我从来没有得到正确的数量，要么太多，要么太少

你能帮我吗？我的代码如下：

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from datetime import datetime
import time


myProxy = {
            "http"  : "http://10.120.118.49:8080",
            "https"  : "https://10.120.118.49:8080"
            }

headers={'User-agent' : 'Mozilla/5.0'}


Product=[]
Price=[]
Discount=[]

chrome_options = Options()

chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--proxy-server=http://10.120.118.49:8080")
chrome_options.add_argument("--headless")


driver = webdriver.Chrome(executable_path='C:/Users/C71220/chromedriver.exe', options=chrome_options)

for u in range(0,6): 

    url='https://www.delhaize.be/nl-be/shop/Dranken-en-alcohol/c/v2DRI?q=:relevance:manufacturerNameFacet:Coca-Cola:manufacturerNameFacet:Schweppes:manufacturerNameFacet:Fanta:manufacturerNameFacet:Chaudfontaine&sort=relevance&pageNumber=' + str(u)    
    driver.get(url)

    try:
        # makes the scraper wait until the element is loaded on the website
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'data-item')))

        for products in driver.find_elements_by_xpath("//div[@class='description anchor--no-style']"):
            Product.append(products.text.strip('\n'))

        product=driver.find_elements_by_xpath("//div[@class='layout-basket-area']")
        for i in product:

            prices=i.find_elements_by_xpath("//span[@class='quantity-price super-bold']")
            for a in prices:
                if a is not None:
                    Price.append(a.text)
                else:
                    Price.append('')


            promotions=i.find_element_by_xpath("//div[@class='PromotionStickerWrapper']")
            if promotions is not None:
                Discount.append(promotions)
            else:
                Discount.append(promotions)



        print('Scraping...')

    except (NoSuchElementException, TimeoutException):
        pass

print(Product, Price, Discount)
print(len(Product))
print(len(Price))
print(len(Discount))

<div class="layout-basket-area"...<div>
   <span class="quantity-price super-bold">

编辑：

价格的HTML代码如下所示：

import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from datetime import datetime
import time


myProxy = {
            "http"  : "http://10.120.118.49:8080",
            "https"  : "https://10.120.118.49:8080"
            }

headers={'User-agent' : 'Mozilla/5.0'}


Product=[]
Price=[]
Discount=[]

chrome_options = Options()

chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--proxy-server=http://10.120.118.49:8080")
chrome_options.add_argument("--headless")


driver = webdriver.Chrome(executable_path='C:/Users/C71220/chromedriver.exe', options=chrome_options)

for u in range(0,6): 

    url='https://www.delhaize.be/nl-be/shop/Dranken-en-alcohol/c/v2DRI?q=:relevance:manufacturerNameFacet:Coca-Cola:manufacturerNameFacet:Schweppes:manufacturerNameFacet:Fanta:manufacturerNameFacet:Chaudfontaine&sort=relevance&pageNumber=' + str(u)    
    driver.get(url)

    try:
        # makes the scraper wait until the element is loaded on the website
        WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'data-item')))

        for products in driver.find_elements_by_xpath("//div[@class='description anchor--no-style']"):
            Product.append(products.text.strip('\n'))

        product=driver.find_elements_by_xpath("//div[@class='layout-basket-area']")
        for i in product:

            prices=i.find_elements_by_xpath("//span[@class='quantity-price super-bold']")
            for a in prices:
                if a is not None:
                    Price.append(a.text)
                else:
                    Price.append('')


            promotions=i.find_element_by_xpath("//div[@class='PromotionStickerWrapper']")
            if promotions is not None:
                Discount.append(promotions)
            else:
                Discount.append(promotions)



        print('Scraping...')

    except (NoSuchElementException, TimeoutException):
        pass

print(Product, Price, Discount)
print(len(Product))
print(len(Price))
print(len(Discount))

<div class="layout-basket-area"...<div>
   <span class="quantity-price super-bold">

代码中的错误太多，无法修复。我重写了一些部分并添加了评论。试试这个：
for u in range(0,6):
    url='https://www.delhaize.be/nl-be/shop/Dranken-en-alcohol/c/v2DRI?q=:relevance:manufacturerNameFacet:Coca-Cola:manufacturerNameFacet:Schweppes:manufacturerNameFacet:Fanta:manufacturerNameFacet:Chaudfontaine&sort=relevance&pageNumber=' + str(u)
    driver.get(url)

    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CLASS_NAME, 'data-item')))

    for product in driver.find_elements_by_class_name("data-item"):
        # get the product list item by class name
        product_name = product.find_element_by_class_name("ProductHeader").text.replace("\n", " - ")
        # try to get the price span by class name with the product list item html else set it to zero
        try:
            product_price = product.find_element_by_class_name("quantity-price").text
            # clean the price by replace € and , and convert it to float
            float_product_price = float(product_price.replace("€","").replace(",","."))
        except NoSuchElementException:
            product_price = "0"
            float_product_price = 0
        # try to get the discount span by class name with the product list item html else set it to zero
        try:
            product_discount = product.find_element_by_class_name("multiLinePromotion").text
            # clean the discount by replace -  %  € and , and convert it to float
            float_product_discount = float (product_discount.replace("- ","").replace("%","").replace("€","").replace(",","."))
        except NoSuchElementException:
            product_discount ="0"
            float_product_discount = 0

        Product.append(product_name)
        Price.append(float_product_price)
        Discount.append(float_product_discount)



print(Product, Price, Discount)
print(len(Product))
print(len(Price))
print(len(Discount))

请发布您正在处理的HTML的全部或至少一个有代表性的片段。否则我们无法帮助您。我添加了页面的html格式。是否有错误？什么是消息和堆栈跟踪？其中一个答案声称代码中有太多错误需要修复，这准确吗？另一方面，变量和函数名应遵循带有下划线的小写形式。