Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/306.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何通过Selenium从网站上获取产品名称?_Python_Selenium_Selenium Webdriver_Web Scraping_Webdriver - Fatal编程技术网

Python 如何通过Selenium从网站上获取产品名称?

Python 如何通过Selenium从网站上获取产品名称?,python,selenium,selenium-webdriver,web-scraping,webdriver,Python,Selenium,Selenium Webdriver,Web Scraping,Webdriver,我正在尝试刮取此页:。但我面临的问题是它只返回一些元素。 我使用的代码如下: from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium import webdriver # Start the We

我正在尝试刮取此页:。但我面临的问题是它只返回一些元素。 我使用的代码如下:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver

# Start the WebDriver and load the page
wd = webdriver.Chrome(executable_path=r"C:\Chrome\chromedriver.exe")
wd.get('https://redmart.com/fresh-produce/fresh-vegetables')

# Wait for the dynamically loaded elements to show up
WebDriverWait(wd, 300).until(
EC.visibility_of_element_located((By.CLASS_NAME, "productDescriptionAndPrice")))

# And grab the page HTML source
html_page = wd.page_source
wd.quit()

# Now you can use html_page as you like
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'lxml')
print(soup)
我需要使用Selenium,因为源代码没有用,因为页面是JAVAscript生成的。如果你打开页面,它大约有60行产品(总共约360个产品)。运行此代码只会给我6行产品。在黄洋葱旁停下来


谢谢

下面是一些Java代码。测试等待30个元素

@Test
public void test1() {
    List<WebElement> found = new WebDriverWait(driver, 300).until(wd -> {
        List<WebElement> elements = driver.findElements(By.className("productDescriptionAndPrice"));
        if(elements.size() > 30)
            return elements ;
        ((JavascriptExecutor) driver).executeScript("window.scrollTo(0, document.body.offsetHeight)");
        return null;
    });
    for (WebElement e : found) {
        System.out.println(e.getText());
    }
}
@测试
公共void test1(){
找到的列表=新的WebDriverWait(驱动程序,300)。直到(wd->{
列表元素=driver.findElements(按.className(“productDescriptionAndPrice”);
如果(elements.size()>30)
返回元素;
((JavascriptExecutor)driver.executeScript(“window.scrollTo(0,document.body.offsetHeight)”);
返回null;
});
for(WebElement e:found){
System.out.println(e.getText());
}
}

根据您的问题和网站
https://redmart.com/fresh-produce/fresh-vegetables
,Selenium可以单独轻松删除所有产品名称。正如您所提到的,总共有大约360种产品,但只有大约35种产品来自特定类别,我为您提供的解决方案如下:

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    item_names = []
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_argument('disable-infobars')
    driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://redmart.com/fresh-produce/fresh-vegetables")
    titles = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='productDescriptionAndPrice']//h4/a")))
    for title in titles:
        item_names.append(title.text)
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        titles = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='productDescriptionAndPrice']//h4/a")))
        for title in titles:
        item_names.append(title.text)
    except:
        pass
    for item_name in item_names:
        print(item_name)
    driver.quit()
    
  • 控制台输出:

    Eco Leaf Baby Spinach Fresh Vegetable
    Eco Leaf Kale Fresh Vegetable
    Sustenir Agriculture Almighty Arugula
    Sustenir Fresh Toscano Black Kale
    Sustenir Fresh Kinky Green Curly Kale
    ThyGrace Honey Cherry Tomato
    Australian Broccoli
    Sustenir Agriculture Italian Basil
    GIVVO Japanese Cucumbers
    YUVVO Red Onions
    Australian Cauliflower
    YUVVO Spring Onion
    GIVVO Old Ginger
    GIVVO Cherry Grape Tomatoes
    YUVVO Holland Potato
    ThyGrace Traffic Light Capsicum Bell Peppers
    GIVVO Whole Garlic
    GIVVO Celery
    Eco Leaf Baby Spinach Fresh Vegetable
    Eco Leaf Kale Fresh Vegetable
    Sustenir Agriculture Almighty Arugula
    Sustenir Fresh Toscano Black Kale
    Sustenir Fresh Kinky Green Curly Kale
    ThyGrace Honey Cherry Tomato
    Australian Broccoli
    Sustenir Agriculture Italian Basil
    GIVVO Japanese Cucumbers
    YUVVO Red Onions
    Australian Cauliflower
    YUVVO Spring Onion
    GIVVO Old Ginger
    GIVVO Cherry Grape Tomatoes
    YUVVO Holland Potato
    ThyGrace Traffic Light Capsicum Bell Peppers
    GIVVO Whole Garlic
    GIVVO Celery
    

注意:您可以构造一个更健壮的XPATH或CSS-SELECTOR,以包含更多的产品并提取相关的产品名称。

嗨,DebanjanB我感谢您的帮助。我花了一整天的时间来尝试这个。真正的问题在于将完整的产品列表放入源代码中。如果一切都在源代码中,我想可以将其提取出来。我相信当你向下滚动时,来源会发生变化,也许这就是为什么我们只能提取36个项目的原因

考虑到这一点,我的初步解决办法如下。它不是完美的,因为我必须做进一步的处理后删除重复。如果您有其他想法或可以进一步优化,我将非常感谢

一般的想法是向下滚动,抓取源代码并附加一个大而长的重叠源代码。我有1400多个产品这样做360产品页面,这就是为什么我说这是一个糟糕的解决方案

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time
from bs4 import BeautifulSoup

# Start the WebDriver and load the page
wd = webdriver.Chrome(executable_path=r"C:\Chrome\chromedriver.exe")
wd.delete_all_cookies()
wd.set_page_load_timeout(30)

wd.get('https://redmart.com/fresh-produce/fresh-vegetables#toggle=all')
time.sleep(5)

html_page = wd.page_source
soup = BeautifulSoup(html_page, 'lxml')

while True:
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(3)
    html_page = wd.page_source
    soup2 = BeautifulSoup(html_page, 'lxml')

    for element in soup2.body:
         soup.body.append(element) 
    time.sleep(2)

    #break condition
    new_height = wd.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
wd.quit()

results = soup.findAll('div', attrs='class':'productDescriptionAndPrice'})
len(results)
results[0] # this tally with the first product
results[-1] # this tallies with the last

老实说,我对这个解决方案相当失望。谢谢,请随时让他们来,让他们来

更换
WebDriverWait(wd,300)。直到长时间处于静态睡眠状态。如果有效,则表示等待时间不足。向下滚动时,页面将生成元素。向脚本添加滚动将加载更多项目。您可能需要等待所需数量的项目加载。@JT您的确切要求是什么?你想把600种产品都刮干净吗?谢谢你们的回复。在这两者之间,我一直在尝试@DebanjanB是的,我正在尝试提取所有产品。我尝试了睡眠,但正如KDM所提到的,当我向下滚动时,项目会加载。所以我想我必须在代码中添加一些滚动。我还做了手动滚动,所以当页面弹出时,我添加了一个time.sleep(30),在这段时间里,我手动使用鼠标滚动,直到所有600种产品都显示出来,我到达了页面的底部。代码接着接管,但这次我只得到了60行产品中的最后22行……感谢KDM,很抱歉我无法运行此代码,因为只有python。但是你得到了所有~600种产品吗?@JT我只试了30种。我猜把数字改成600应该行。关键是要滚动到页面末尾以加载更多项目。Hi-KDM此行是否缺少某些部分?找到的列表=新的WebDriverWait(驱动程序,300)。直到(wd->{正如我在回答中提到的,我只容纳了一个类。构造一个xpath来覆盖多个类会给您带来更多的结果。逻辑将保持不变。很抱歉,其他哪个类,因为我认为所有产品都在ProductDescription和Price类下?请澄清一下,类保持不变,但有效的xpath是dif所以您需要构造一个xpath,它将覆盖多个节点。