Python 在多个页面中刮削多个页面（刮削）_Python_Web Scraping_Scrapy

Python 在多个页面中刮削多个页面（刮削）

python web-scraping scrapy

Python 在多个页面中刮削多个页面（刮削）,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在努力弄清楚我需要设置的代码结构，以便在多个页面中刮取多个页面。我的意思是：我从主页开始，主页上有所有字母的URL。每个字母都是犬种名称的起始字母对于每个字母，都有多页的狗品种。我需要进入每一个狗品种页面对于每种狗，都有多页的狗上市出售。我需要从每个销售列表页面中提取数据如前所述，我正在努力理解代码的结构需要是什么样的。问题的一部分是我不完全理解python代码流是如何工作的。像这样的事情是否正确： def parse Get URL of all the alphab

我正在努力弄清楚我需要设置的代码结构，以便在多个页面中刮取多个页面。我的意思是：

我从主页开始，主页上有所有字母的URL。每个字母都是犬种名称的起始字母

对于每个字母，都有多页的狗品种。我需要进入每一个狗品种页面

对于每种狗，都有多页的狗上市出售。我需要从每个销售列表页面中提取数据

如前所述，我正在努力理解代码的结构需要是什么样的。问题的一部分是我不完全理解python代码流是如何工作的。像这样的事情是否正确：

def parse
       Get URL of all the alphabet letters
       pass on the URL to parse_A

def parse_A
      Get URL of all pages for that alphabet letter
      pass on the URL to parse_B

def parse_B
      Get URL for all breeds listed on that page of that alphabet letter
      pass on the URL to parse_C

def parse_C
      Get URL for all the pages of dogs listed of that specific breed
      pass on the URL to parse_D

def parse_D
      Get URL of specific for sale listing of that dog breed on that page
      pass on the URL to parse_E

def parse_E
     Get all of the details for that specific listing
     Callback to ??

对于parse_E中的最后一个回调，我是将回调指向parse_D还是指向第一个解析

谢谢大家!

使用scrapy时必须遵循如下结构

def parse():
    """
    Get URL of all URLs from the alphabet letters (breed_urls)
    :return:
    """
    breed_urls = 'parse the urls'
    for url in breed_urls:
        yield scrapy.Request(url=url, callback=self.parse_sub_urls)


def parse_sub_urls(response):
    """
    Get URL of all SubUrls from the subPage (sub_urls)
    :param response:
    :return:
    """
    sub_urls= 'parse the urls'
    for url in sub_urls:
        yield scrapy.Request(url=url, callback=self.parse_details)

    next_page = 'parse the page url'
    if next_page:
        yield scrapy.Request(url=next_page, callback=self.parse_sub_urls)

def parse_details(response):
    """
    Get the final details from the listing page
    :param response:
    :return:
    """

    details = {}
    name = 'parse the urls'
    details['name'] = name

    # parse all other details and append to the dictionary

    yield details

使用scrapy时，必须遵循如下结构

def parse():
    """
    Get URL of all URLs from the alphabet letters (breed_urls)
    :return:
    """
    breed_urls = 'parse the urls'
    for url in breed_urls:
        yield scrapy.Request(url=url, callback=self.parse_sub_urls)


def parse_sub_urls(response):
    """
    Get URL of all SubUrls from the subPage (sub_urls)
    :param response:
    :return:
    """
    sub_urls= 'parse the urls'
    for url in sub_urls:
        yield scrapy.Request(url=url, callback=self.parse_details)

    next_page = 'parse the page url'
    if next_page:
        yield scrapy.Request(url=next_page, callback=self.parse_sub_urls)

def parse_details(response):
    """
    Get the final details from the listing page
    :param response:
    :return:
    """

    details = {}
    name = 'parse the urls'
    details['name'] = name

    # parse all other details and append to the dictionary

    yield details

不，您可以在那里“生成”解析数据。@pguardiario谢谢！不，您可以在那里“生成”解析数据。@pguardiario谢谢！谢谢你，阿伦！我现在有一个小毛病。一些品种只有一页的列表（小狗出售）。我试图使用Try和Except语句来解决这个问题。我告诉它尝试查找多个页面链接，如果找到了，则继续下一个解析部分。我设置的except语句只调用下一个解析部分。但是，对于“Except”语句，如何回调下一个解析部分而不向其传递新的URL？我知道如何使用的唯一语句是yield response.follow，它需要遵循URL。列表迭代中只有一个元素的列表不会有任何问题。要处理

next_page

场景，请编写代码，在上述yield之后检查下一页URL状态。如果状态为true，则解析并生成其他情况。我在

parse\u sub\u URL

函数中进行了一些更新。看看，非常感谢你，阿伦！说不出这对我理解流程有多大帮助：）谢谢你，阿伦！我现在有一个小毛病。一些品种只有一页的列表（小狗出售）。我试图使用Try和Except语句来解决这个问题。我告诉它尝试查找多个页面链接，如果找到了，则继续下一个解析部分。我设置的except语句只调用下一个解析部分。但是，对于“Except”语句，如何回调下一个解析部分而不向其传递新的URL？我知道如何使用的唯一语句是yield response.follow，它需要遵循URL。列表迭代中只有一个元素的列表不会有任何问题。要处理

next_page

场景，请编写代码，在上述yield之后检查下一页URL状态。如果状态为true，则解析并生成其他情况。我在

parse\u sub\u URL

函数中进行了一些更新。看看，非常感谢你，阿伦！说不出这对我理解流程有多大帮助：）