Python 如何在通过scrapy下载时为图像指定自定义名称
这是我通过图像管道下载图像的程序。它可以很好地下载图像,但问题是在sha1哈希中重命名图像,之后我无法识别它们。是否有任何解决方案可以让我使用**model_name作为要下载的图像Python 如何在通过scrapy下载时为图像指定自定义名称,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,这是我通过图像管道下载图像的程序。它可以很好地下载图像,但问题是在sha1哈希中重命名图像,之后我无法识别它们。是否有任何解决方案可以让我使用**model_name作为要下载的图像 import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import Sgml
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium import webdriver
from urlparse import urljoin
import time
class CompItem(scrapy.Item):
model_name = scrapy.Field()
images = scrapy.Field()
image_urls = scrapy.Field()
image_name = scrapy.Field()
class criticspider(CrawlSpider):
name = "buysmaart_images"
allowed_domains = ["http://buysmaart.com/"]
start_urls = ["http://buysmaart.com/productdetails/550/Samsung-Galaxy-Note-4", "http://buysmaart.com/productdetails/115/HTC-One-M8-Eye", "http://buysmaart.com/productdetails/506/OPPO-N1", "http://buysmaart.com/productdetails/342/LG-G2-D802T"]
def __init__(self, *args, **kwargs):
super(criticspider, self).__init__(*args, **kwargs)
self.download_delay = 0.25
self.browser = webdriver.Firefox()
self.browser.implicitly_wait(2)
def parse_start_url(self, response):
self.browser.get(response.url)
time.sleep(8)
sel = Selector(text=self.browser.page_source)
item = CompItem()
photos = sel.xpath('//ul[contains(@id,"productImageUl")]/li')
print len(photos)
all_photo_urls = []
for photo in photos:
item['image_name'] = sel.xpath('.//h3[contains(@class,"ng-binding")]/text()').extract()[0].encode('ascii','ignore')
#tmp_url = photo.xpath('.//img/@src').extract()[0].encode('ascii','ignore')
image_url = photo.xpath('.//img/@src').extract()[0]
all_photo_urls.append(image_url)
item['image_urls'] = all_photo_urls
yield item
管道
from scrapy.contrib.pipeline.images import DownloadImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class DownloadImagesPipeline(object):
def process_item(self, item, spider):
def get_media_requests(self, item, info):
return [Request(x, meta={'image_names': item["image_name"]})
for x in item.get('image_urls', [])]
def get_images(self, response, request, info):
for key, image, buf, in super(DownloadImagesPipeline, self).get_images(response, request, info):
if re.compile('^[0-9,a-f]+.jpg$').match(key):
key = self.change_filename(key, response)
yield key, image, buf
def change_filename(self, key, response):
return "%s.jpg" % response.meta['image_name'][0]
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
设置
BOT_NAME = 'download_images'
SPIDER_MODULES = ['download_images.spiders']
NEWSPIDER_MODULE = 'download_images.spiders'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE= '/home/john/Desktop/download_images/31_jul'
解决方案是重写
DownloadImagesPipeline
类的image\u键
方法
def image_key(self, url):
return 'image_name.here'
例如,如果需要URL的图像名称,可以使用
url.split('/')[-1]
作为图像的名称。
注意此方法已被弃用,可以在将来的版本中删除
或者,您也可以在爬行器中为您的图像设置图像\u名称
:
item['image_name'] = ['whatever_you_want']
在这种情况下,您必须进一步扩展管道,以利用您提供的图像名称:
def get_media_requests(self, item, info):
return [Request(x, meta={'image_names': item["image_name"]})
for x in item.get('image_urls', [])]
def get_images(self, response, request, info):
for key, image, buf, in super(DownloadImagesPipeline, self).get_images(response, request, info):
if re.compile('^[0-9,a-f]+.jpg$').match(key):
key = self.change_filename(key, response)
yield key, image, buf
def change_filename(self, key, response):
return "%s.jpg" % response.meta['image_name'][0]
当然,您的管道应该扩展ImagesPipeline
解决方案是重写DownloadImagesPipeline
类的image\u key
方法
def image_key(self, url):
return 'image_name.here'
例如,如果需要URL的图像名称,可以使用
url.split('/')[-1]
作为图像的名称。
注意此方法已被弃用,可以在将来的版本中删除
或者,您也可以在爬行器中为您的图像设置图像\u名称
:
item['image_name'] = ['whatever_you_want']
在这种情况下,您必须进一步扩展管道,以利用您提供的图像名称:
def get_media_requests(self, item, info):
return [Request(x, meta={'image_names': item["image_name"]})
for x in item.get('image_urls', [])]
def get_images(self, response, request, info):
for key, image, buf, in super(DownloadImagesPipeline, self).get_images(response, request, info):
if re.compile('^[0-9,a-f]+.jpg$').match(key):
key = self.change_filename(key, response)
yield key, image, buf
def change_filename(self, key, response):
return "%s.jpg" % response.meta['image_name'][0]
当然,您的管道应该扩展imagesipeline
Scrapy 1.3.3解决方案(覆盖image\u下载的方法):
Scrapy 1.3.3解决方案(覆盖下载的图像方法):
我已经更新了我的spider和管道,但仍然没有给我名称。您的设置中是否也包含了图像管道?ITEM_PIPELINES=['download_images.PIPELINES.DownloadImagesPipeline']
或您的DownloadImagesPipeline
类的路径,无论它位于何处。名称错误:模块'scrapy.contrib.pipeline.images'未定义任何名为'DownloadImagesPipeline'的对象。您是如何更新设置的?你知道,你必须将路径添加到你的类中——它不在scrapy.contrib
中,但在你的项目中。我已经更新了我的爬行器和管道,仍然没有给我名称。你的设置中也包括了你的图像管道吗?ITEM_PIPELINES=['download_images.pipeline.DownloadImagesPipeline']
或您的DownloadImagesPipeline
类的路径,无论它位于何处。名称错误:模块'scrapy.contrib.pipeline.images'未定义任何名为'DownloadImagesPipeline'的对象。您是如何更新设置的?你知道,你必须把路径添加到你的类中——它不在scrapy.contrib
中,而是在你的项目中