Python 从网站抓取网页以捕获分页链接时出现问题_Python_Selenium_Web Scraping_Beautifulsoup_Request

Python 从网站抓取网页以捕获分页链接时出现问题

python selenium web-scraping

Python 从网站抓取网页以捕获分页链接时出现问题,python,selenium,web-scraping,beautifulsoup,request,Python,Selenium,Web Scraping,Beautifulsoup,Request,我正试图从主页上列出的所有类别URL（完成）和网站及其分页链接的进一步子类别页面中刮取数据。URL是我已经创建了Python脚本，用于以模块化结构提取数据，因为我需要在一个单独的文件中从一个步骤到另一个步骤的所有URL的输出。但现在我面临着提取所有分页URL的问题，之后将从中获取数据。此外，我只从第一个子类别URL获取数据，而不是从所有列出的子类别URL获取数据例如，在我下面的脚本中，数据来自>>>>>> 一般做法（主分类页）-和进一步听诊器（子分类页）- 只会来。我想从所有列出的子类别链接

我正试图从主页上列出的所有类别URL（完成）和网站及其分页链接的进一步子类别页面中刮取数据。URL是

我已经创建了Python脚本，用于以模块化结构提取数据，因为我需要在一个单独的文件中从一个步骤到另一个步骤的所有URL的输出。但现在我面临着提取所有分页URL的问题，之后将从中获取数据。此外，我只从第一个子类别URL获取数据，而不是从所有列出的子类别URL获取数据

例如，在我下面的脚本中，数据来自>>>>>>

一般做法（主分类页）-和进一步听诊器（子分类页）-

只会来。我想从所有列出的子类别链接的数据在此链接上给出

任何帮助将不胜感激，以获得我所需的输出，从所有列出的子类别网页的产品网址

代码如下：

import re
import time
import random
import selenium.webdriver.support.ui as ui
from selenium.common.exceptions import TimeoutException, NoSuchElementException 
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from lxml import html  
from bs4 import BeautifulSoup
from datetime import datetime
import csv
import os
from fake_useragent import UserAgent

# Function to write data to a file:
def write_to_file(file,mode, data, newline=None, with_tab=None):   #**
    with open(file, mode, encoding='utf-8') as l:
        if with_tab == True:
            data = ''.join(data)
        if newline == True:
            data = data+'\n'
        l.write(data)

# Function for data from Module 1:
def send_link(link1):
    browser = webdriver.Chrome()
    browser.get(link1)
    current_page = browser.current_url 
    print (current_page) 
    soup = BeautifulSoup(browser.page_source,"lxml")
    tree = html.fromstring(str(soup))

# Added try and except in order to skip/pass attributes without any value.
    try:
        main_category_url = browser.find_elements_by_xpath("//li[@class=\"univers-group-item\"]/span/a[1][@href]")
        main_category_url = [i.get_attribute("href") for i in main_category_url[4:]]
        print(len(main_category_url))

    except NoSuchElementException:
        main_category_url = ''

    for index, data in enumerate(main_category_url):
        with open('Module_1_OP.tsv', 'a', encoding='utf-8') as outfile:
            data = (main_category_url[index] + "\n")
            outfile.write(data)

# Data Extraction for Categories under HEADERS:
    try:
        sub_category_url = browser.find_elements_by_xpath("//li[@class=\"category-group-item\"]/a[1][@href]")
        sub_category_url = [i.get_attribute("href") for i in sub_category_url[:]]
        print(len(sub_category_url))
    except NoSuchElementException:
        sub_category_url = ''

    for index, data in enumerate(sub_category_url):
        with open('Module_1_OP.tsv', 'a', encoding='utf-8') as outfile:
            data = (sub_category_url[index] + "\n")
            outfile.write(data)
            
    csvfile = open("Module_1_OP.tsv") 
    csvfilelist = csvfile.readlines()
    send_link2(csvfilelist)

# Function for data from Module 2:
def send_link2(links2): 
    browser = webdriver.Chrome()
    start = 7
    end = 10
    for link2 in (links2[start:end]):    
        print(link2) 

        ua = UserAgent() 
        try:
            ua = UserAgent()
        except FakeUserAgentError:
            pass

        ua.random == 'Chrome'

        proxies = [] 

        t0 = time.time()
        response_delay = time.time() - t0 
        time.sleep(10*response_delay) 
        time.sleep(random.randint(2,5)) 
        browser.get(link2) 
        current_page = browser.current_url 
        print (current_page) 
        soup = BeautifulSoup(browser.page_source,"lxml")
        tree = html.fromstring(str(soup))

        # Added try and except in order to skip/pass attributes without value.
        try:
            product_url = browser.find_elements_by_xpath('//ul[@class=\"category-grouplist\"]/li/a[1][@href]')
            product_url = [i.get_attribute("href") for i in product_url]
            print(len(product_url))
        except NoSuchElementException:
            product_url = ''

        try:
            product_title = browser.find_elements_by_xpath("//ul[@class=\"category-grouplist\"]/li/a[1][@href]") # Use FindelementS for extracting multiple section data
            product_title = [i.text for i in product_title[:]]
            print(product_title)
        except NoSuchElementException:
            product_title = ''
        
        for index, data2 in enumerate(product_title):
            with open('Module_1_2_OP.tsv', 'a', encoding='utf-8') as outfile:
                data2 = (current_page + "\t" + product_url[index] + "\t" + product_title[index] + "\n")
                outfile.write(data2)

        for index, data3 in enumerate(product_title):
            with open('Module_1_2_OP_URL.tsv', 'a', encoding='utf-8') as outfile:
                data3 = (product_url[index] + "\n")
                outfile.write(data3)

        csvfile = open("Module_1_2_OP_URL.tsv")
        csvfilelist = csvfile.readlines()
        send_link3(csvfilelist)

# Function for data from Module 3:
def send_link3(csvfilelist): 
    browser = webdriver.Chrome()
    for link3 in csvfilelist[:3]:
        print(link3) 
        browser.get(link3) 
        time.sleep(random.randint(2,5))
        current_page = browser.current_url 
        print (current_page) 
        soup = BeautifulSoup(browser.page_source,"lxml")
        tree = html.fromstring(str(soup))

        try:
            pagination = browser.find_elements_by_xpath("//div[@class=\"pagination-wrapper\"]/a[@href]")
            pagination = [i.get_attribute("href") for i in pagination]
            print(pagination)

        except NoSuchElementException:
            pagination = ''

        for index, data2 in enumerate(pagination):
            with open('Module_1_2_3_OP.tsv', 'a', encoding='utf-8') as outfile:
                data2 = (current_page + "\n" + pagination[index] + "\n")
                outfile.write(data2)

        dataset = open("Module_1_2_3_OP.tsv") 
        dataset_dup = dataset.readlines() 
        duplicate(dataset_dup)

# Used to remove duplicate records from a List:
def duplicate(dataset):
    dup_items = set()
    uniq_items = []
    for x in dataset:
        if x not in dup_items:
            uniq_items.append(x)
            dup_items.add(x)
            write_to_file('Listing_pagination_links.tsv','w', dup_items, newline=True, with_tab=True)

    csvfile = open("Listing_pagination_links.tsv") 
    csvfilelist = csvfile.readlines()
    send_link4(csvfilelist)

# Function for data from Module 4:
def send_link4(links3):
    browser = webdriver.Chrome()
    for link3 in links3:
      print(link3)
      browser.get(link3) 
      t0 = time.time()
      response_delay = time.time() - t0 
      time.sleep(10*response_delay) 
      time.sleep(random.randint(2,5)) 
      sub_category_page = browser.current_url 
      print (sub_category_page) 
      soup = BeautifulSoup(browser.page_source,"lxml")
      tree = html.fromstring(str(soup))

      # Added try and except in order to skip/pass attributes without value.
      try:
        product_url1 = browser.find_elements_by_xpath('//div[@class=\"inset-caption price-container\"]/a[1][@href]')
        product_url1 = [i.get_attribute("href") for i in product_url1]
        print(len(product_url1))
      except NoSuchElementException:
        product_url1 = ''

      for index, data in enumerate(product_url1):
        with open('Final_Output_' + datestring + '.tsv', 'a', encoding='utf-8') as outfile:
          data = (sub_category_page + "\t" + product_url1[index] + "\n")
          outfile.write(data)

# PROGRAM STARTS EXECUTING FROM HERE...
# Added to attach Real Date and Time field to Output filename
datestring = datetime.strftime(datetime.now(), '%Y-%m-%d-%H-%M-%S') # For filename
#datestring2 = datetime.strftime(datetime.now(), '%H-%M-%S') # For each record

send_link("http://www.medicalexpo.com/")

实际上，你根本不需要硒。下面的代码将获取网站上所有内容的类别、子类别和项目链接、名称和描述

唯一棘手的部分是处理分页的while循环。原则是，如果网站上有“下一步”按钮，我们需要加载更多内容。在这种情况下，站点实际上在下一个标记中为我们提供了“下一个”链接，因此很容易迭代，直到没有更多的下一个链接可检索为止

记住，当你运行这个程序时，可能需要一段时间。还请记住，您可能应该在while循环中的每个请求之间插入一个sleep（例如1秒），以便更好地处理服务器

这样做会降低你被禁止/类似行为的风险

import requests
from bs4 import BeautifulSoup
from time import sleep

items_list = [] # list of dictionaries with this content: category, sub_category, item_description, item_name, item_link 

r = requests.get("http://www.medicalexpo.com/")
soup = BeautifulSoup(r.text, "lxml")
cat_items = soup.find_all('li', class_="category-group-item")
cat_items = [[cat_item.get_text().strip(),cat_item.a.get('href')] for cat_item in cat_items]

# cat_items is now a list with elements like this:
# ['General practice','http://www.medicalexpo.com/cat/general-practice-K.html']
# to access the next level, we loop:

for category, category_link in cat_items[:1]:
    print("[*] Extracting data for category: {}".format(category))

    r = requests.get("http://www.medicalexpo.com/cat/general-practice-K.html")
    soup = BeautifulSoup(r.text, "lxml")
    # data of all sub_categories are located in an element with the id 'category-group'
    cat_group = soup.find('div', attrs={'id': 'category-group'})

    # the data lie in 'li'-tags
    li_elements = cat_group.find_all('li')
    sub_links = [[li.a.get('href'), li.get_text().strip()] for li in li_elements]

    # sub_links is now a list og elements like this:
    # ['http://www.medicalexpo.com/medical-manufacturer/stethoscope-2.html', 'Stethoscopes']

    # to access the last level we need to dig further in with a loop
    for sub_category_link, sub_category in sub_links:
        print("  [-] Extracting data for sub_category: {}".format(sub_category))
        local_count = 0
        load_page = True
        item_url = sub_category_link
        while load_page:
            print("     [-] Extracting data for item_url: {}".format(item_url))
            r = requests.get(item_url)
            soup = BeautifulSoup(r.text, "lxml")
            item_links = soup.find_all('div', class_="inset-caption price-container")[2:]
            for item in item_links:
                item_name = item.a.get_text().strip().split('\n')[0]
                item_link = item.a.get('href')
                try:
                    item_description = item.a.get_text().strip().split('\n')[1]
                except:
                    item_description = None
                item_dict = {
                    "category": category,
                    "subcategory": sub_category,
                    "item_name": item_name,
                    "item_link": item_link,
                    "item_description": item_description
                }
                items_list.append(item_dict)
                local_count +=1
            # all itempages has a pagination element
            # if there are more pages to load, it will have a "next"-class
            # if we are on the last page, the will not be a next class and "next_link" will return None
            pagination = soup.find(class_="pagination-wrapper")
            try:
                next_link = pagination.find(class_="next").get('href', None)
            except:
                next_link = None
            # consider inserting a sleep(1) right about here...
            # if the next_link exists it means that there are more pages to load
            # we'll then set the item_url = next_link and the While-loop will continue
            if next_link is not None:
                item_url = next_link
            else:
                load_page = False
        print("      [-] a total of {} item_links extracted for this sub_category".format(local_count))

# this will yield a list of dicts like this one:

# {'category': 'General practice',
#  'item_description': 'Flac duo',
#  'item_link': 'http://www.medicalexpo.com/prod/boso-bosch-sohn/product-67891-821119.html',
#  'item_name': 'single-head stethoscope',
#  'subcategory': 'Stethoscopes'}

# If you need to export to something like excel, uses pandas. Create a DataFrame and simple load it with the list
# pandas can the export the stuff to excel easily...

我认为你会更好（更快）通过反向工程的网站。查看使用搜索字段时“网络”选项卡中发生的情况。Json作为回报…非常感谢Jlaur对您的评论和反馈。我增加了随机睡眠时间，并进行了锻炼。只需要在Excel中提取输出，因为它在代码中引用了您的注释，谢谢！您好，Jlaur，我正试图使用Pandas将数据导出到Excel中，但无法做到，您能在这方面提供帮助吗，因为我是Python Pandas的新手。您遇到了什么错误？也许你需要一个模块，它不是pandas附带的（openpyxl-如果是这样的话，只需转到

python-m pip安装openpyxl

就可以了…）是的，我在excel中获得了数据。谢谢你的帮助，当然。很乐意帮忙。