Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮页:跟随分页链接以刮取数据_Python_Xpath_Web Scraping_Scrapy - Fatal编程技术网

Python 刮页:跟随分页链接以刮取数据

Python 刮页:跟随分页链接以刮取数据,python,xpath,web-scraping,scrapy,Python,Xpath,Web Scraping,Scrapy,我正在尝试从页面中抓取数据,并继续按照分页链接进行抓取 我正在尝试刮取的页面是--> 问题 代码无法跟随分页链接 你能帮忙吗 修改代码以跟随分页链接 它不起作用,因为url无效。如果您想继续使用scrapy.Request,您可以使用: next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first() if next_page_url: next_page_url = response.urljo

我正在尝试从页面中抓取数据,并继续按照分页链接进行抓取

我正在尝试刮取的页面是-->

问题
  • 代码无法跟随分页链接
你能帮忙吗
  • 修改代码以跟随分页链接

    • 它不起作用,因为url无效。如果您想继续使用
      scrapy.Request
      ,您可以使用:

      next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
      if next_page_url:
          next_page_url = response.urljoin(next_page_url)
          yield scrapy.Request(url=next_page_url, callback=self.parse)
      
      较短的解决方案:

      next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
      if next_page_url:
          yield response.follow(next_page_url)
      

      它不起作用,因为url无效。如果您想继续使用
      scrapy.Request
      ,您可以使用:

      next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
      if next_page_url:
          next_page_url = response.urljoin(next_page_url)
          yield scrapy.Request(url=next_page_url, callback=self.parse)
      
      较短的解决方案:

      next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
      if next_page_url:
          yield response.follow(next_page_url)
      

      要使代码正常工作,需要使用
      response.follow()
      或类似方法修复断开的链接。尝试下面的方法

      import scrapy
      
      class AlibabaSpider(scrapy.Spider):
          name = 'alibaba'
          allowed_domains = ['alibaba.com']
          start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']
      
          def parse(self, response):
              for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
                  item = {
                  'product_name': products.xpath('.//h2/a/@title').extract_first(),
                  'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
                  'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
                  'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
                  'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
                  'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
                  #'image_url': products.xpath('.//div[@class=""]/').extract_first(),
                  }
                  yield item
      
              #Follow the paginatin link
              next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
              if next_page_url:
                  yield response.follow(url=next_page_url, callback=self.parse)
      

      您粘贴的代码缩进严重。我也解决了这个问题。

      要使代码正常工作,您需要使用
      response.follow()
      或类似方法修复断开的链接。尝试下面的方法

      import scrapy
      
      class AlibabaSpider(scrapy.Spider):
          name = 'alibaba'
          allowed_domains = ['alibaba.com']
          start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']
      
          def parse(self, response):
              for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
                  item = {
                  'product_name': products.xpath('.//h2/a/@title').extract_first(),
                  'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
                  'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
                  'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
                  'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
                  'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
                  #'image_url': products.xpath('.//div[@class=""]/').extract_first(),
                  }
                  yield item
      
              #Follow the paginatin link
              next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
              if next_page_url:
                  yield response.follow(url=next_page_url, callback=self.parse)
      
      您粘贴的代码缩进严重。我也解决了这个问题