Python Scrapy无法将图片下载到本地

Python Scrapy无法将图片下载到本地,python,scrapy,pipeline,Python,Scrapy,Pipeline,我正在使用爬网一个网站。我需要做三件事: 我需要的类别和图像的子类别 我需要下载图像并将其存储在本地 我需要在Mongo中存储类别、子类别和图像url 但现在我被阻止了,我使用“管道”下载图片,但我的代码无法工作,它无法将图片下载到本地 另外,由于我想在Mongo中存储信息,任何人都可以给我一些关于“Mongo表结构”的建议 我的代码如下: 2014-12-21 13:49:56+0800 [scrapy] INFO: Enabled item pipelines: TutorialPipeli

我正在使用爬网一个网站。我需要做三件事:

  • 我需要的类别和图像的子类别
  • 我需要下载图像并将其存储在本地
  • 我需要在Mongo中存储类别、子类别和图像url
  • 但现在我被阻止了,我使用“管道”下载图片,但我的代码无法工作,它无法将图片下载到本地

    另外,由于我想在Mongo中存储信息,任何人都可以给我一些关于“Mongo表结构”的建议

    我的代码如下:

    2014-12-21 13:49:56+0800 [scrapy] INFO: Enabled item pipelines: TutorialPipeline
    2014-12-21 13:49:56+0800 [baidu] INFO: Spider opened
    2014-12-21 13:49:56+0800 [baidu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2014-12-21 13:49:56+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2014-12-21 13:49:56+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2014-12-21 13:50:07+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/categories> (referer: None)
    2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/science/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/sports/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/news-politics/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/transportation/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/interests/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/memes/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/tv/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/gaming/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/nature/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/emotions/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/movies/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/holiday/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/reactions/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/music/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/decades/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/search/the-colbert-report/> (referer: http://giphy.com//categories/news-politics/)
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
        {'image_urls': u'http://media1.giphy.com/media/2BDLDXFaEiuBy/200_s.gif'}
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
        {'image_urls': u'http://media2.giphy.com/media/WisjAI5QGgsrC/200_s.gif'}
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
        {'image_urls': u'http://media3.giphy.com/media/ZgDGEMihlZXCo/200_s.gif'}
    .............
    
    设置.py

    BOT_NAME = 'tutorial'
    
    SPIDER_MODULES = ['tutorial.spiders']
    NEWSPIDER_MODULE = 'tutorial.spiders'
    
    ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1}
    IMAGES_STORE = '/ttt'
    
    from scrapy.item import Item, Field
    
        class TutorialItem(Item):
            # define the fields for your item here like:
            # name = Field()
            catname=Field()
            caturl=Field()
            image_urls = Field()
            images = Field()
            pass
    
    from scrapy.contrib.pipeline.images import ImagesPipeline
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from pprint import pprint as pp
    
    class TutorialPipeline(object):
        # def get_media_requests(self, item, info):
        #     for image_url in item['image_urls']:
        #         yield Request(image_url)
    
        # def process_item(self, item, spider):
            # print '**********************===================*******************'
            # return item
            # pp(item)
            # pass
    
        def get_media_requests(self,item,info):
            # pass
            pp('**********************===================*******************')
    
            # yield Request(item['image_urls'])
            for image_url in item['image_urls']:
                # pass
                # print image_url
                yield Request(image_url)
    
    import scrapy
    import os
    from pprint import pprint as pp
    from scrapy import log
    from scrapy.http import Request
    from scrapy.selector import Selector
    from scrapy.spider import Spider
    
    from scrapy.spider import Spider
    from scrapy.selector import Selector
    
    from tutorial.items import TutorialItem
    from pprint import pprint as pp
    
    class BaiduSpider(scrapy.spider.Spider):
        name='baidu'
        start_urls=[
            # 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
            'http://giphy.com/categories'
        ]
    
        domain='http://giphy.com'
    
        def parse(self,response):
            selector=Selector(response)
    
            topCategorys=selector.xpath('//div[@id="None-list"]/a')
    
            # pp(topCategorys)
            items=[]
            for tc in topCategorys:
                item=TutorialItem()
                item['catname']=tc.xpath('./text()').extract()[0]
                item['caturl']=tc.xpath('./@href').extract()[0]
                if item['catname']==u'ALL':
                    continue
                reqUrl=self.domain+'/'+item['caturl']
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getSecondCategory)
        def getSecondCategory(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
            secondCategorys=selector.xpath('//div[@class="grid_9 omega featured-category-tags"]/div/a')
    
            # pp(secondCategorys)
            items=[]
            for sc in secondCategorys:
                item=TutorialItem()
                item['catname']=sc.xpath('./div/h4/text()').extract()[0]
                item['caturl']=sc.xpath('./@href').extract()[0]
                items.append(item)
    
                reqUrl=self.domain+item['caturl']
            # pp(items)
                # pp(item)
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getImages)
    
        def getImages(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
    
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]/div[position()=3]')
            images=selector.xpath('//*[contains (@class,"hoverable-gif")]')
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]//div[@class="hoverable-gif"]')
            # pp(len(images))
            items=[]
            for image in images:
                item=TutorialItem()
                item['image_urls']=image.xpath('./a/figure/img/@src').extract()[0]
                # item['imgName']=image.xpath('./a/figure/img/@alt').extract()[0]
                items.append(item)
                # pp(item)
                # pp(items)
                # pp('==============************==============')
    
            # pp(items)
            # items=[{'images':"hello world"}]
            return items
    
    items.py

    BOT_NAME = 'tutorial'
    
    SPIDER_MODULES = ['tutorial.spiders']
    NEWSPIDER_MODULE = 'tutorial.spiders'
    
    ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1}
    IMAGES_STORE = '/ttt'
    
    from scrapy.item import Item, Field
    
        class TutorialItem(Item):
            # define the fields for your item here like:
            # name = Field()
            catname=Field()
            caturl=Field()
            image_urls = Field()
            images = Field()
            pass
    
    from scrapy.contrib.pipeline.images import ImagesPipeline
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from pprint import pprint as pp
    
    class TutorialPipeline(object):
        # def get_media_requests(self, item, info):
        #     for image_url in item['image_urls']:
        #         yield Request(image_url)
    
        # def process_item(self, item, spider):
            # print '**********************===================*******************'
            # return item
            # pp(item)
            # pass
    
        def get_media_requests(self,item,info):
            # pass
            pp('**********************===================*******************')
    
            # yield Request(item['image_urls'])
            for image_url in item['image_urls']:
                # pass
                # print image_url
                yield Request(image_url)
    
    import scrapy
    import os
    from pprint import pprint as pp
    from scrapy import log
    from scrapy.http import Request
    from scrapy.selector import Selector
    from scrapy.spider import Spider
    
    from scrapy.spider import Spider
    from scrapy.selector import Selector
    
    from tutorial.items import TutorialItem
    from pprint import pprint as pp
    
    class BaiduSpider(scrapy.spider.Spider):
        name='baidu'
        start_urls=[
            # 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
            'http://giphy.com/categories'
        ]
    
        domain='http://giphy.com'
    
        def parse(self,response):
            selector=Selector(response)
    
            topCategorys=selector.xpath('//div[@id="None-list"]/a')
    
            # pp(topCategorys)
            items=[]
            for tc in topCategorys:
                item=TutorialItem()
                item['catname']=tc.xpath('./text()').extract()[0]
                item['caturl']=tc.xpath('./@href').extract()[0]
                if item['catname']==u'ALL':
                    continue
                reqUrl=self.domain+'/'+item['caturl']
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getSecondCategory)
        def getSecondCategory(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
            secondCategorys=selector.xpath('//div[@class="grid_9 omega featured-category-tags"]/div/a')
    
            # pp(secondCategorys)
            items=[]
            for sc in secondCategorys:
                item=TutorialItem()
                item['catname']=sc.xpath('./div/h4/text()').extract()[0]
                item['caturl']=sc.xpath('./@href').extract()[0]
                items.append(item)
    
                reqUrl=self.domain+item['caturl']
            # pp(items)
                # pp(item)
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getImages)
    
        def getImages(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
    
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]/div[position()=3]')
            images=selector.xpath('//*[contains (@class,"hoverable-gif")]')
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]//div[@class="hoverable-gif"]')
            # pp(len(images))
            items=[]
            for image in images:
                item=TutorialItem()
                item['image_urls']=image.xpath('./a/figure/img/@src').extract()[0]
                # item['imgName']=image.xpath('./a/figure/img/@alt').extract()[0]
                items.append(item)
                # pp(item)
                # pp(items)
                # pp('==============************==============')
    
            # pp(items)
            # items=[{'images':"hello world"}]
            return items
    
    管道。py

    BOT_NAME = 'tutorial'
    
    SPIDER_MODULES = ['tutorial.spiders']
    NEWSPIDER_MODULE = 'tutorial.spiders'
    
    ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1}
    IMAGES_STORE = '/ttt'
    
    from scrapy.item import Item, Field
    
        class TutorialItem(Item):
            # define the fields for your item here like:
            # name = Field()
            catname=Field()
            caturl=Field()
            image_urls = Field()
            images = Field()
            pass
    
    from scrapy.contrib.pipeline.images import ImagesPipeline
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from pprint import pprint as pp
    
    class TutorialPipeline(object):
        # def get_media_requests(self, item, info):
        #     for image_url in item['image_urls']:
        #         yield Request(image_url)
    
        # def process_item(self, item, spider):
            # print '**********************===================*******************'
            # return item
            # pp(item)
            # pass
    
        def get_media_requests(self,item,info):
            # pass
            pp('**********************===================*******************')
    
            # yield Request(item['image_urls'])
            for image_url in item['image_urls']:
                # pass
                # print image_url
                yield Request(image_url)
    
    import scrapy
    import os
    from pprint import pprint as pp
    from scrapy import log
    from scrapy.http import Request
    from scrapy.selector import Selector
    from scrapy.spider import Spider
    
    from scrapy.spider import Spider
    from scrapy.selector import Selector
    
    from tutorial.items import TutorialItem
    from pprint import pprint as pp
    
    class BaiduSpider(scrapy.spider.Spider):
        name='baidu'
        start_urls=[
            # 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
            'http://giphy.com/categories'
        ]
    
        domain='http://giphy.com'
    
        def parse(self,response):
            selector=Selector(response)
    
            topCategorys=selector.xpath('//div[@id="None-list"]/a')
    
            # pp(topCategorys)
            items=[]
            for tc in topCategorys:
                item=TutorialItem()
                item['catname']=tc.xpath('./text()').extract()[0]
                item['caturl']=tc.xpath('./@href').extract()[0]
                if item['catname']==u'ALL':
                    continue
                reqUrl=self.domain+'/'+item['caturl']
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getSecondCategory)
        def getSecondCategory(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
            secondCategorys=selector.xpath('//div[@class="grid_9 omega featured-category-tags"]/div/a')
    
            # pp(secondCategorys)
            items=[]
            for sc in secondCategorys:
                item=TutorialItem()
                item['catname']=sc.xpath('./div/h4/text()').extract()[0]
                item['caturl']=sc.xpath('./@href').extract()[0]
                items.append(item)
    
                reqUrl=self.domain+item['caturl']
            # pp(items)
                # pp(item)
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getImages)
    
        def getImages(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
    
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]/div[position()=3]')
            images=selector.xpath('//*[contains (@class,"hoverable-gif")]')
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]//div[@class="hoverable-gif"]')
            # pp(len(images))
            items=[]
            for image in images:
                item=TutorialItem()
                item['image_urls']=image.xpath('./a/figure/img/@src').extract()[0]
                # item['imgName']=image.xpath('./a/figure/img/@alt').extract()[0]
                items.append(item)
                # pp(item)
                # pp(items)
                # pp('==============************==============')
    
            # pp(items)
            # items=[{'images':"hello world"}]
            return items
    
    spider.py

    BOT_NAME = 'tutorial'
    
    SPIDER_MODULES = ['tutorial.spiders']
    NEWSPIDER_MODULE = 'tutorial.spiders'
    
    ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1}
    IMAGES_STORE = '/ttt'
    
    from scrapy.item import Item, Field
    
        class TutorialItem(Item):
            # define the fields for your item here like:
            # name = Field()
            catname=Field()
            caturl=Field()
            image_urls = Field()
            images = Field()
            pass
    
    from scrapy.contrib.pipeline.images import ImagesPipeline
    from scrapy.exceptions import DropItem
    from scrapy.http import Request
    from pprint import pprint as pp
    
    class TutorialPipeline(object):
        # def get_media_requests(self, item, info):
        #     for image_url in item['image_urls']:
        #         yield Request(image_url)
    
        # def process_item(self, item, spider):
            # print '**********************===================*******************'
            # return item
            # pp(item)
            # pass
    
        def get_media_requests(self,item,info):
            # pass
            pp('**********************===================*******************')
    
            # yield Request(item['image_urls'])
            for image_url in item['image_urls']:
                # pass
                # print image_url
                yield Request(image_url)
    
    import scrapy
    import os
    from pprint import pprint as pp
    from scrapy import log
    from scrapy.http import Request
    from scrapy.selector import Selector
    from scrapy.spider import Spider
    
    from scrapy.spider import Spider
    from scrapy.selector import Selector
    
    from tutorial.items import TutorialItem
    from pprint import pprint as pp
    
    class BaiduSpider(scrapy.spider.Spider):
        name='baidu'
        start_urls=[
            # 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
            'http://giphy.com/categories'
        ]
    
        domain='http://giphy.com'
    
        def parse(self,response):
            selector=Selector(response)
    
            topCategorys=selector.xpath('//div[@id="None-list"]/a')
    
            # pp(topCategorys)
            items=[]
            for tc in topCategorys:
                item=TutorialItem()
                item['catname']=tc.xpath('./text()').extract()[0]
                item['caturl']=tc.xpath('./@href').extract()[0]
                if item['catname']==u'ALL':
                    continue
                reqUrl=self.domain+'/'+item['caturl']
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getSecondCategory)
        def getSecondCategory(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
            secondCategorys=selector.xpath('//div[@class="grid_9 omega featured-category-tags"]/div/a')
    
            # pp(secondCategorys)
            items=[]
            for sc in secondCategorys:
                item=TutorialItem()
                item['catname']=sc.xpath('./div/h4/text()').extract()[0]
                item['caturl']=sc.xpath('./@href').extract()[0]
                items.append(item)
    
                reqUrl=self.domain+item['caturl']
            # pp(items)
                # pp(item)
                # pp(reqUrl)
                yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getImages)
    
        def getImages(self,response):
            selector=Selector(response)
            # pp(response.meta['caturl'])
            # pp('*****************=================**************')
    
    
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]/div[position()=3]')
            images=selector.xpath('//*[contains (@class,"hoverable-gif")]')
            # images=selector.xpath('//ul[@class="gifs  freeform grid_12"]//div[@class="hoverable-gif"]')
            # pp(len(images))
            items=[]
            for image in images:
                item=TutorialItem()
                item['image_urls']=image.xpath('./a/figure/img/@src').extract()[0]
                # item['imgName']=image.xpath('./a/figure/img/@alt').extract()[0]
                items.append(item)
                # pp(item)
                # pp(items)
                # pp('==============************==============')
    
            # pp(items)
            # items=[{'images':"hello world"}]
            return items
    
    另外,输出中没有错误,如下所示:

    2014-12-21 13:49:56+0800 [scrapy] INFO: Enabled item pipelines: TutorialPipeline
    2014-12-21 13:49:56+0800 [baidu] INFO: Spider opened
    2014-12-21 13:49:56+0800 [baidu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2014-12-21 13:49:56+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2014-12-21 13:49:56+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2014-12-21 13:50:07+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/categories> (referer: None)
    2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/science/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/sports/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/news-politics/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/transportation/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/interests/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/memes/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/tv/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/gaming/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/nature/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/emotions/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/movies/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/holiday/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/reactions/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/music/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/decades/> (referer: http://giphy.com/categories)
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/search/the-colbert-report/> (referer: http://giphy.com//categories/news-politics/)
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
        {'image_urls': u'http://media1.giphy.com/media/2BDLDXFaEiuBy/200_s.gif'}
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
        {'image_urls': u'http://media2.giphy.com/media/WisjAI5QGgsrC/200_s.gif'}
    2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
        {'image_urls': u'http://media3.giphy.com/media/ZgDGEMihlZXCo/200_s.gif'}
    .............
    
    2014-12-21 13:49:56+0800[scrapy]信息:启用的项目管道:教程管道
    2014-12-21 13:49:56+0800[百度]信息:蜘蛛开启
    2014-12-21 13:49:56+0800[百度]信息:抓取0页(0页/分钟),抓取0条(0条/分钟)
    2014-12-21 13:49:56+0800[scrapy]调试:Telnet控制台在0.0.0.0:6023上侦听
    2014-12-21 13:49:56+0800[scrapy]调试:在0.0.0.0:6080上侦听Web服务
    2014-12-21 13:50:07+0800[百度]调试:爬网(200)(参考:无)
    2014-12-21 13:50:08+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:08+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:08+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:09+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:09+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:09+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:09+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:09+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:10+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:10+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:10+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:10+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:11+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:11+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:11+0800[百度]调试:爬网(200)(参考:http://giphy.com/categories)
    2014-12-21 13:50:12+0800[百度]调试:爬网(200)(参考:http://giphy.com//categories/news-politics/)
    2014-12-21 13:50:12+0800[百度]调试:从
    {'image\u url':u'http://media1.giphy.com/media/2BDLDXFaEiuBy/200_s.gif'}
    2014-12-21 13:50:12+0800[百度]调试:从
    {'image\u url':u'http://media2.giphy.com/media/WisjAI5QGgsrC/200_s.gif'}
    2014-12-21 13:50:12+0800[百度]调试:从
    {'image\u url':u'http://media3.giphy.com/media/ZgDGEMihlZXCo/200_s.gif'}
    .............
    
    在我看来,您没有必要重写
    imagesipeline
    ,因为您没有修改它的行为。但是,既然你在做,就应该正确地做。
    重写
    imagesipeline
    时,应重写两种方法:

    • get\u media\u请求(项目、信息)应该为
      image\u URL
      中的每个URL返回一个
      Request
      。这部分你做得对

    • item\u completed(results,items,info)在单个项目的所有图像请求都已完成(完成下载或由于某种原因失败)时调用。从:

      item_completed()方法必须返回将要发送的输出 到后续项目管道阶段,因此必须返回(或删除) 项目,就像在任何管道中一样

    因此,要使自定义图像管道正常工作,需要覆盖item_completed()方法,如下所示:

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item
    
    此外,关于代码中使其无法按预期工作的其他问题:

  • 您实际上没有创建任何有用的项目。
    如果查看
    parse()
    getSecondCategory()
    函数,您将注意到没有返回或生成任何项。尽管您似乎已经准备好了
    项目
    列表,您显然想用它来存储您的项目,但它从来没有被用来在处理路径上进一步传递项目。在某一点上,您只需为下一页生成一个
    请求
    ,当函数完成时,您的
    项目将被删除

  • 您没有使用通过
    meta
    字典传递的
    caturl
    信息。您在
    parse()
    ˙和
    getSecondCategory()
    中传递此信息,但您从未在回调函数中收集此信息。因此,它也被忽视

  • 所以,如果你按照我的建议修复它,唯一基本上可以工作的就是图像管道。为了解决代码中的这些问题,请遵循以下准则(请记住,这不是经过测试的,只是供您考虑的准则):


    请提供您得到的错误/结果。这将有助于回答者更好地分类您的代码。@kartikg3没有错误,看起来一切正常。