如何使用python beautifulsoup从下拉菜单中提取数据

如何使用python beautifulsoup从下拉菜单中提取数据,python,web-scraping,drop-down-menu,beautifulsoup,Python,Web Scraping,Drop Down Menu,Beautifulsoup,我试图从一个网站上抓取数据,该网站有一个多级下拉菜单,每次选择一个项目时,它都会更改子下拉菜单中的子项目。 问题是,对于每个循环,它都从下拉项中提取相同的子项。选择发生,但它不会代表循环中的新选择更新项目 有谁能帮我解释一下为什么我没有得到想要的结果。 也许这是因为我的下拉列表是用java脚本或其他东西编写的 例如,如下图所示: 我已经走了这么远: enter code here from selenium import webdriver from selenium.webdriver.su

我试图从一个网站上抓取数据,该网站有一个多级下拉菜单,每次选择一个项目时,它都会更改子下拉菜单中的子项目。 问题是,对于每个循环,它都从下拉项中提取相同的子项。选择发生,但它不会代表循环中的新选择更新项目 有谁能帮我解释一下为什么我没有得到想要的结果。 也许这是因为我的下拉列表是用java脚本或其他东西编写的

例如,如下图所示: 我已经走了这么远:

enter code here

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
import csv

//#from selenium.webdriver.support import Select 
import time

print ("opening chorome....")  
driver = webdriver.Chrome()
driver.get('https://www.wheelmax.com/')
time.sleep(10)

csvData = ['Year', 'Make', 'Model', 'Body', 'Submodel', 'Size']

//#variables
yeart = []
make= []
model=[]
body = []
submodel = []
size = []
Yindex = Mkindex = Mdindex = Bdindex = Smindex = Sindex = 0

print ("waiting for program to set variables....")
time.sleep(20)

print ("initializing and setting variables....")

//#initializing Year
Year = Select(driver.find_element_by_id("icm-years-select"))
Year.select_by_value('2020')
yr = driver.find_elements(By.XPATH, '//*[@id="icm-years-select"]')
time.sleep(15)

//#initializing Make
Make = Select(driver.find_element_by_id("icm-makes-select"))
Make.select_by_index(1)
mk = driver.find_elements(By.XPATH, '//*[@id="icm-makes-select"]')
time.sleep(15)

//#initializing Model
Model = Select(driver.find_element_by_id("icm-models-select"))
Model.select_by_index(1)
mdl = driver.find_elements(By.XPATH, '//*[@id="icm-models-select"]')
time.sleep(15)

//#initializing body
Body = Select(driver.find_element_by_id("icm-drivebodies-select"))
Body.select_by_index(1)
bdy = driver.find_elements(By.XPATH, '//*[@id="icm-drivebodies-select"]')
time.sleep(15)

//#initializing submodel
Submodel = Select(driver.find_element_by_id("icm-submodels-select"))
Submodel.select_by_index(1)
sbm = driver.find_elements(By.XPATH, '//*[@id="icm-submodels-select"]')
time.sleep(15)

//#initializing size
Size = Select(driver.find_element_by_id("icm-sizes-select"))
Size.select_by_index(0)
siz = driver.find_elements(By.XPATH, '//*[@id="icm-sizes-select"]')
time.sleep(5)


Cyr = Cmk = Cmd = Cbd = Csmd = Csz = ""

print ("fetching data from variables....")

for y in yr:
    obj1 = driver.find_element_by_id("icm-years-select")
    Year = Select(obj1)
    Year.select_by_index(++Yindex)
    obj1.click()
    #obj1.click()
    yeart.append(y.text)
    Cyr = y.text
    time.sleep(10)
    for m in mk:
        obj2 = driver.find_element_by_id("icm-makes-select")
        Make = Select(obj2)
        Make.select_by_index(++Mkindex)
        obj2.click()
        #obj2.click()
        make.append(m.text)
        Cmk = m.text
        time.sleep(10)
        for md in mdl:
            Mdindex =0
            obj3 = driver.find_element_by_id("icm-models-select")
            Model = Select(obj3)
            Model.select_by_index(++Mdindex)
            obj3.click()
            #obj3.click(clickobj)
            model.append(md.text)
            Cmd = md.text
            time.sleep(10)
            Bdindex = 0
            for bd in bdy:
                obj4 = driver.find_element_by_id("icm-drivebodies-select")
                Body = Select(obj4)
                Body.select_by_index(++Bdindex)
                obj4.click()
                #obj4.click(clickobj2)
                body.append(bd.text)
                Cbd = bd.text
                time.sleep(10)
                Smindex = 0
                for sm in sbm:
                    obj5 = driver.find_element_by_id("icm-submodels-select")
                    Submodel = Select(obj5)
                    obj5.click()
                    Submodel.select_by_index(++Smindex)
                    #obj5.click(clickobj5)
                    submodel.append(sm.text)
                    Csmd = sm.text
                    time.sleep(10)
                    Sindex = 0
                    for sz in siz:
                        Size = Select(driver.find_element_by_id("icm-sizes-select"))
                        Size.select_by_index(++Sindex)
                        size.append(sz.text)
                        Scz = sz.text
                        csvData += [Cyr, Cmk, Cmd, Cbd,Csmd, Csz]

我猜你不能用beautiful soup解析年份的原因是,包含所有年份的“选项”标签的“选择”标签在beautiful soup下载页面时还没有出现/隐藏。它是通过执行额外的JavaScript添加到DOM中的。如果您使用浏览器的开发人员工具查看加载页面的DOM,例如F12 for Mozilla,您将看到包含您要查找的信息的标记是:
,因为
https://www.wheelmax.com
具有相互依赖的多级下拉菜单,例如,如果您选择
选择年份
下拉选项,基于
选择年份后,选择Make
下拉菜单为启用并基于所选年份选项显示选项

因此,基本上您需要使用
Selenium
包来处理动态选项

根据您的浏览器安装selenium web驱动程序

下载chrome网络驱动程序:

unzip ~/Downloads/chromedriver_linux64.zip -d ~/Downloads
chmod +x ~/Downloads/chromedriver
sudo mv -f ~/Downloads/chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

为chrome浏览器安装web驱动程序:

unzip ~/Downloads/chromedriver_linux64.zip -d ~/Downloads
chmod +x ~/Downloads/chromedriver
sudo mv -f ~/Downloads/chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
硒教程

例如,使用selenium选择多个下拉选项

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Chrome()
driver.get('https://www.wheelmax.com/')
time.sleep(4)

selectYear = Select(driver.find_element_by_id("icm-years-select"))
selectYear.select_by_value('2019')

time.sleep(2)

selectMakes = Select(driver.find_element_by_id("icm-makes-select"))
selectMakes.select_by_value('58')
更新:

unzip ~/Downloads/chromedriver_linux64.zip -d ~/Downloads
chmod +x ~/Downloads/chromedriver
sudo mv -f ~/Downloads/chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
选择下拉选项值或合计选项

for option in selectYear.options:
    print(option.text)

print(len(selectYear.options))

如何使用python beautifulsoup从下拉菜单中提取数据

页面执行回调以填充年份。简单地模仿一下

如果您确实需要更改年份并从相关下拉列表中选择,这将成为一个不同的问题,您需要浏览器自动化,例如selenium,或者手动执行此操作并检查网络选项卡,以查看是否存在xhr请求,您可以模拟以提交选择

import requests
​
r = requests.get('https://www.iconfigurators.com/json2/?returnType=json&bypass=true&id=13898&callback=yearObj').json()
years = [item['year'] for item in r['years']]
print(years)

您能建议我如何执行相关下拉列表。您是指根据选择检索数据,还是仅根据下拉列表之间的映射检索数据?您想建议的任何链接,我正试图通过我的自我介绍来学习网络报废,我的意思是根据选择检索数据我需要它用于我自己的网站如果从非您自己的网站获取内容,请小心。请参见我回答的中间部分关于选择的内容。我建议用SO上现有的问题来研究一下,然后发布一个新问题。在链接方面,我知道有很多python web抓取教程。我试了几次,但没有找到我真正喜欢的python语言。你可以在StackOverflow和YouTube上找到你想要的大部分内容。你确定selenium可以解决这个问题吗??我需要完整的数据提取@GeekOnline是的,selenium可以解决您的问题,让我添加一些代码来指导您如何使用selenium。不过我有一个问题!如何选择或计算特定下拉列表中的所有项目大多数答案来自java!