Selenium/BeautifulSoup-Python-Loop遍历多个页面_Python_Selenium_Selenium Webdriver_Web Scraping_Beautifulsoup

Selenium/BeautifulSoup-Python-Loop遍历多个页面

python selenium selenium-webdriver web-scraping

Selenium/BeautifulSoup-Python-Loop遍历多个页面,python,selenium,selenium-webdriver,web-scraping,beautifulsoup,Python,Selenium,Selenium Webdriver,Web Scraping,Beautifulsoup,我花了一天的大部分时间研究和测试在零售商网站上循环浏览一系列产品的最佳方法虽然我能够在第一个页面上成功地收集产品集（和属性），但我一直在寻找最好的方法来循环浏览网站的页面以继续我的搜索根据下面的代码，我尝试使用“while”循环和Selenium单击网站的“next page”按钮，然后继续收集产品问题是我的代码仍然没有通过第1页我是不是犯了个愚蠢的错误？在这个网站上阅读4到5个类似的例子，但是没有一个足够具体，可以在这里提供解决方案 from selenium import webdr

我花了一天的大部分时间研究和测试在零售商网站上循环浏览一系列产品的最佳方法

虽然我能够在第一个页面上成功地收集产品集（和属性），但我一直在寻找最好的方法来循环浏览网站的页面以继续我的搜索

根据下面的代码，我尝试使用“while”循环和Selenium单击网站的“next page”按钮，然后继续收集产品

问题是我的代码仍然没有通过第1页

我是不是犯了个愚蠢的错误？在这个网站上阅读4到5个类似的例子，但是没有一个足够具体，可以在这里提供解决方案

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products.clear()
hyperlinks.clear()
reviewCounts.clear()
starRatings.clear()

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1


html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

从selenium导入webdriver
从bs4导入BeautifulSoup
driver=webdriver.Chrome（）
司机，上车https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+剪影：扣下%20件衬衫+类别：上衣+部门：服装&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0'）
products.clear（）
超链接
reviewCounts.clear（）
starRatings.clear（）
产品=[]
超链接=[]
reviewCounts=[]
星号=[]
pageCounter=0
maxPageCount=int（html\u soup.find（'a'，class\u='totalPageNum'）.text）+1
html\u soup=BeautifulSoup（driver.page\u源代码'html.parser'）
prod\u containers=html\u soup.find\u all（'li'，class\u='products\u grid'））
而（页面计数器<最大页面计数）：
对于产品容器中的产品：
#如果产品有审查计数，则提取：
如果product.find（'span'，class='prod\u ratingCount'）不是无：
#产品名称
name=product.find（'div'，class='prod\u nameBlock'）
name=re.sub（r“\s+”，“”，name.text）
products.append（名称）
#产品超链接
hyperlink=product.find（'span'，class='prod\U ratingCount'）
超级链接
hyperlink=hyperlink.get（'href'）
超链接。附加（超链接）
#产品审查计数
reviewCount=product.find（'span'，class='prod\u ratingCount'）.a.text
reviewCounts.append（reviewCount）
#该产品的整体星级
starRating=product.find（'span'，class='prod\u ratingCount'）
a.主演，主演
starRating=starRating.get（'alt'）
星号。附加（星号）
驱动程序。通过xpath（'/*[@id=“page navigation top”]/a[2]”查找元素。单击（）
反作用+=1
打印（反效果）

好的，当从

.py

文件单独运行时，这段代码不会运行，我猜您是在iPython或类似环境中运行的，并且已经初始化了这些变量并导入了库

首先，您需要包括regex包：

重新导入

另外，所有这些

clear（）

语句都不是必需的，因为您无论如何都要初始化所有这些列表（实际上python无论如何都会抛出一个错误，因为在调用clear时这些列表还没有定义）

您还需要初始化

反作用：
反效果=0

最后，在代码中引用之前，您必须为html\u soup
设置一个值：
html\u soup=BeautifulSoup（driver.page\u source，'html.parser'）

以下是正在工作的更正代码：
from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1
prod_containers = html_soup.find_all('li', class_ = 'products_grid')
counterProduct = 0
while (pageCounter < maxPageCount):
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    counterProduct +=1
    print(counterProduct)

从selenium导入webdriver
从bs4导入BeautifulSoup
进口稀土
driver=webdriver.Chrome（）
司机，上车https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+剪影：扣下%20件衬衫+类别：上衣+部门：服装&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0'）
产品=[]
超链接=[]
reviewCounts=[]
星号=[]
pageCounter=0
html\u soup=BeautifulSoup（driver.page\u源代码'html.parser'）
maxPageCount=int（html\u soup.find（'a'，class\u='totalPageNum'）.text）+1
prod\u containers=html\u soup.find\u all（'li'，class\u='products\u grid'））
反效果=0
而（页面计数器<最大页面计数）：
对于产品容器中的产品：
#如果产品有审查计数，则提取：
如果product.find（'span'，class='prod\u ratingCount'）不是无：
#产品名称
name=product.find（'div'，class='prod\u nameBlock'）
name=re.sub（r“\s+”，“”，name.text）
products.append（名称）
#产品超链接
hyperlink=product.find（'span'，class='prod\U ratingCount'）
超级链接
hyperlink=hyperlink.get（'href'）
超链接。附加（超链接）
#产品审查计数
reviewCount=product.find（'span'，class='prod\u ratingCount'）.a.text
reviewCounts.append（reviewCount）
#该产品的整体星级
starRating=product.find（'span'，class='prod\u ratingCount'）
a.主演，主演
starRating=starRating.get（'alt'）
星号。附加（星号）
驱动程序。通过xpath（'/*[@id=“page navigation top”]/a[2]”查找元素。单击（）
反作用+=1
打印（反效果）
每次“单击”下一页时，都需要对其进行解析。因此，您需要将其包含在while循环中，否则您将继续在第一个页面上迭代，即使它单击到下一个页面，因为prod_containers对象永远不会更改
其次，按照您的方式，您的while循环将永远不会停止，因为您将pageCounter设置为0，但永远不会增加它……它将永远<您的maxPageCount
我在代码中修复了这两件事并运行了它，它似乎已经运行并解析了第1页到第5页
from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver = webdriver.Chrome()
driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products = []
hyperlinks = []
reviewCounts = []
starRatings = []

pageCounter = 0

html_soup = BeautifulSoup(driver.page_source, 'html.parser')
maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1

prod_containers = html_soup.find_all('li', class_ = 'products_grid')


while (pageCounter < maxPageCount):
    html_soup = BeautifulSoup(driver.page_source, 'html.parser')
    prod_containers = html_soup.find_all('li', class_ = 'products_grid')
    for product in prod_containers:
        # If the product has review count, then extract:
        if product.find('span', class_ = 'prod_ratingCount') is not None:
            # The product name
            name = product.find('div', class_ = 'prod_nameBlock')
            name = re.sub(r"\s+", " ", name.text)
            name = name.strip()
            products.append(name)

            # The product hyperlink
            hyperlink = product.find('span', class_ = 'prod_ratingCount')
            hyperlink = hyperlink.a
            hyperlink = hyperlink.get('href')
            hyperlinks.append(hyperlink)

            # The product review count
            reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text
            reviewCounts.append(reviewCount)

            # The product overall star ratings
            starRating = product.find('span', class_ = 'prod_ratingCount')
            starRating = starRating.a
            starRating = starRating.get('alt')
            starRatings.append(starRating) 

    driver.find_element_by_xpath('//*[@id="page-navigation-top"]/a[2]').click()
    pageCounter +=1
    print(pageCounter)

来自selenium import我们