Web scraping 刮削多页_Web Scraping_Scrapy_Web Crawler

Web scraping 刮削多页

web-scraping scrapy web-crawler

Web scraping 刮削多页,web-scraping,scrapy,web-crawler,Web Scraping,Scrapy,Web Crawler,我有一个功能，可以刮个别网页。我如何在遵循相应的链接后刮取多个页面？我是否需要一个单独的函数来调用parse（），如下面的gotoIndivPage（）？谢谢大家! import scrapy class trainingScraper(scrapy,Spider): name = "..." start_urls = "url with links to multiple pages" # for scraping individual page def parse

我有一个功能，可以刮个别网页。我如何在遵循相应的链接后刮取多个页面？我是否需要一个单独的函数来调用parse（），如下面的gotoIndivPage（）？谢谢大家!

import scrapy

class trainingScraper(scrapy,Spider):
   name = "..."
   start_urls = "url with links to multiple pages"

   # for scraping individual page
   def parse(self,response):
      SELECTOR1 = '.entry-title ::text'
      SELECTOR2 = '//li[@class="location"]/ul/li/a/text()'
      yield{
         'title': response.css(SELECTOR1).extract_first(),
         'date': response.xpath(SELECTOR2).extract_first(),
      }

   def gotoIndivPage(self,response):
      PAGE_SELECTOR = '//h3[@class="entry-title"]/a/@href'
      for page in response.xpath(PAGE_SELECTOR):
         if page:
            yield scrapy.Request(
               response.urljoin(page),
               callback=self.parse
            )

通常，我会为我尝试使用的每种不同类型的HTML结构创建一个新函数。因此，如果您的链接将您发送到一个具有不同HTML结构的页面，然后发送到起始页面，我将创建一个新函数并将其传递给回调函数

 def parseNextPage(self, response): 
   # Parse new page

 def parse(self,response):
       SELECTOR1 = '.entry-title ::text'
       SELECTOR2 = '//li[@class="example"]/ul/li/a/text()'

       yield{
         'title': response.css(SELECTOR1).extract_first(),
         'date': response.xpath(SELECTOR2).extract_first(),
       }
       href = //li[@class="location"]/ul/li/a/@href

       yield scrapy.Request(
           url = href,
           callback=self.parseNextPage
        )