Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/299.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用scrapy获取URL的类型类别_Python_Scrapy - Fatal编程技术网

Python 使用scrapy获取URL的类型类别

Python 使用scrapy获取URL的类型类别,python,scrapy,Python,Scrapy,为此,我需要所有的产品URL和它们各自的类型 因此,输出应为: Product_URL1 Blouse Product_URL2 Crop Top Product_URL3 Tank Top Product_URL4 Strappy Top Product_URL5 Tube Top 下面是我的代码,我想除了项目['type'的xpath之外,一切都是正确的 from scrapy.spiders import CrawlSpider import scrapy from scrapy.htt

为此,我需要所有的产品URL和它们各自的类型

因此,输出应为:

Product_URL1 Blouse
Product_URL2 Crop Top
Product_URL3 Tank Top
Product_URL4 Strappy Top
Product_URL5 Tube Top
下面是我的代码,我想除了项目['type'的xpath之外,一切都是正确的

from scrapy.spiders import CrawlSpider
import scrapy
from scrapy.http.request import Request

class JabongItem(scrapy.Item):
  base_link = scrapy.Field()
  type = scrapy.Field()
  count = scrapy.Field()
  product_name = scrapy.Field()
  product_link = scrapy.Field()


class JabongScrape(CrawlSpider):
    name = "jabong"
    allowed_domains = "jabong.com"
    start_urls = ["http://www.jabong.com/women/clothing/tops-tees-shirts/tops", "http://www.jabong.com/women/clothing/tops-tees-shirts/tees"]


    def parse(self, response):
        item=JabongItem()
        try:
            for idx in range(0, 20):
                item['type']=response.xpath("//div[contains(@class, 'options')]/label/a/text()").extract()[idx]
                item['base_link']=response.url+response.xpath("//div[contains(@class, 'options')]/label/a/@href").extract()[idx] + "?ax=1&page=1&limit=" + (response.xpath("//div[contains(@class, 'options')]/label/small/text()").extract()[idx]).replace("[","").replace("]","") + "&sortField=popularity&sortBy=desc"
                item['count']= (response.xpath("//div[contains(@class, 'options')]/label/small/text()").extract()[idx]).replace("[","").replace("]","")
                yield Request(item['base_link'],callback=self.parse_product_link,
                      meta={'item': item, 'count': int(item['count'])}, dont_filter=True)
        except:
            pass

    def parse_product_link(self,response):
        item=response.meta['item']
        try:
            for i in range(0, response.meta['count']):
                item['product_link']=response.xpath("//div[contains(@class, 'col-xxs-6 col-xs-4 col-sm-4 col-md-3 col-lg-3 product-tile img-responsive')]/a/@href").extract()[i]
                # item['original_price']=response.xpath("section.row > div:nth-child(1) > a:nth-child(1) > div:nth-child(2) > div:nth-child(2) > span:nth-child(1) > span:nth-child(1)::text").extract()[idx]
                print i
                yield item
        except:
            pass

正如Rafael指出的,jbng_base_links.txt包含“

”,最简单的方法是手动重新构造爬行器,以遵循以下顺序:

  • 转到网页
  • 查找类型URL
  • 转到每种类型的url->刮取项目
  • 它可以简单到:

    class MySpider(scrapy.Spider):
        name = 'myspider'
        start_urls = []
    
        def parse(self, response):
            """this will parse landing page for type urls"""
            urls = response.xpath("//div[contains(text(),'Type')]/..//a/@href").extract()
            for url in urls:
                url = response.urljoin(url)
                yield scrapy.Requests(url, self.parse_type)
    
        def parse_type(self, response):
            """this will parse every type page for items"""
            type_name = response.xpath("//a[@class='filtered-brand']/text()").extract_first()  
            product_urls = ...
            for url in product_urls:
                yield {'type': type_name, 'url': url}
            # handle next page
    

    我的建议是使用
    Spider
    而不是
    CrawlSpider
    来刮取,并分别刮取每种类型,它们有不同的链接,例如衬衫