Python 如何在通过scrapy下载时为图像指定自定义名称_Python_Web Crawler_Scrapy

Python 如何在通过scrapy下载时为图像指定自定义名称

python web-crawler scrapy

Python 如何在通过scrapy下载时为图像指定自定义名称,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,这是我通过图像管道下载图像的程序。它可以很好地下载图像，但问题是在sha1哈希中重命名图像，之后我无法识别它们。是否有任何解决方案可以让我使用**model_name作为要下载的图像 import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import Sgml

这是我通过图像管道下载图像的程序。它可以很好地下载图像，但问题是在sha1哈希中重命名图像，之后我无法识别它们。是否有任何解决方案可以让我使用**model_name作为要下载的图像

   import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium import webdriver
from urlparse import urljoin
import time

class CompItem(scrapy.Item):
    model_name = scrapy.Field()
    images = scrapy.Field()
    image_urls = scrapy.Field()
    image_name = scrapy.Field()

class criticspider(CrawlSpider):
    name = "buysmaart_images"
    allowed_domains = ["http://buysmaart.com/"]
    start_urls = ["http://buysmaart.com/productdetails/550/Samsung-Galaxy-Note-4",  "http://buysmaart.com/productdetails/115/HTC-One-M8-Eye",  "http://buysmaart.com/productdetails/506/OPPO-N1",  "http://buysmaart.com/productdetails/342/LG-G2-D802T"]

    def __init__(self, *args, **kwargs):
        super(criticspider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Firefox()
        self.browser.implicitly_wait(2)

    def parse_start_url(self, response):
        self.browser.get(response.url)
        time.sleep(8)
        sel = Selector(text=self.browser.page_source)
        item = CompItem()

        photos = sel.xpath('//ul[contains(@id,"productImageUl")]/li')
        print len(photos)
        all_photo_urls = []
        for photo in photos:
            item['image_name'] = sel.xpath('.//h3[contains(@class,"ng-binding")]/text()').extract()[0].encode('ascii','ignore')
            #tmp_url = photo.xpath('.//img/@src').extract()[0].encode('ascii','ignore')
            image_url = photo.xpath('.//img/@src').extract()[0]
            all_photo_urls.append(image_url)
            item['image_urls'] = all_photo_urls
        yield item

管道

    from scrapy.contrib.pipeline.images import DownloadImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class DownloadImagesPipeline(object):
    def process_item(self, item, spider):
         def get_media_requests(self, item, info):
        return [Request(x, meta={'image_names': item["image_name"]})
                for x in item.get('image_urls', [])]

def get_images(self, response, request, info):
    for key, image, buf, in super(DownloadImagesPipeline, self).get_images(response, request, info):
        if re.compile('^[0-9,a-f]+.jpg$').match(key):
            key = self.change_filename(key, response)
        yield key, image, buf

def change_filename(self, key, response):
    return "%s.jpg" % response.meta['image_name'][0]

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

设置

BOT_NAME = 'download_images'

SPIDER_MODULES = ['download_images.spiders']
NEWSPIDER_MODULE = 'download_images.spiders'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE= '/home/john/Desktop/download_images/31_jul'

解决方案是重写

DownloadImagesPipeline

类的

image\u键

方法

def image_key(self, url):
    return 'image_name.here'

例如，如果需要URL的图像名称，可以使用

url.split('/')[-1]

作为图像的名称。注意此方法已被弃用，可以在将来的版本中删除

或者，您也可以在

爬行器中为您的图像设置图像\u名称
：
item['image_name'] = ['whatever_you_want']

在这种情况下，您必须进一步扩展管道，以利用您提供的图像名称：
def get_media_requests(self, item, info):
        return [Request(x, meta={'image_names': item["image_name"]})
                for x in item.get('image_urls', [])]

def get_images(self, response, request, info):
    for key, image, buf, in super(DownloadImagesPipeline, self).get_images(response, request, info):
        if re.compile('^[0-9,a-f]+.jpg$').match(key):
            key = self.change_filename(key, response)
        yield key, image, buf

def change_filename(self, key, response):
    return "%s.jpg" % response.meta['image_name'][0]

当然，您的管道应该扩展ImagesPipeline
解决方案是重写DownloadImagesPipeline
类的image\u key
方法
def image_key(self, url):
    return 'image_name.here'

例如，如果需要URL的图像名称，可以使用
url.split('/')[-1]

作为图像的名称。
注意此方法已被弃用，可以在将来的版本中删除
或者，您也可以在爬行器中为您的图像设置图像\u名称
：
item['image_name'] = ['whatever_you_want']

在这种情况下，您必须进一步扩展管道，以利用您提供的图像名称：
def get_media_requests(self, item, info):
        return [Request(x, meta={'image_names': item["image_name"]})
                for x in item.get('image_urls', [])]

def get_images(self, response, request, info):
    for key, image, buf, in super(DownloadImagesPipeline, self).get_images(response, request, info):
        if re.compile('^[0-9,a-f]+.jpg$').match(key):
            key = self.change_filename(key, response)
        yield key, image, buf

def change_filename(self, key, response):
    return "%s.jpg" % response.meta['image_name'][0]

当然，您的管道应该扩展imagesipeline
Scrapy 1.3.3解决方案（覆盖image\u下载的方法）：
Scrapy 1.3.3解决方案（覆盖下载的图像方法）：
我已经更新了我的spider和管道，但仍然没有给我名称。您的设置中是否也包含了图像管道？ITEM_PIPELINES=['download_images.PIPELINES.DownloadImagesPipeline']
或您的DownloadImagesPipeline
类的路径，无论它位于何处。名称错误：模块'scrapy.contrib.pipeline.images'未定义任何名为'DownloadImagesPipeline'的对象。您是如何更新设置的？你知道，你必须将路径添加到你的类中——它不在scrapy.contrib
中，但在你的项目中。我已经更新了我的爬行器和管道，仍然没有给我名称。你的设置中也包括了你的图像管道吗？ITEM_PIPELINES=['download_images.pipeline.DownloadImagesPipeline']
或您的DownloadImagesPipeline
类的路径，无论它位于何处。名称错误：模块'scrapy.contrib.pipeline.images'未定义任何名为'DownloadImagesPipeline'的对象。您是如何更新设置的？你知道，你必须把路径添加到你的类中——它不在scrapy.contrib
中，而是在你的项目中