Scrapy 刮皮不';t下载图片
我正在尝试通过scrapy从不同的URL下载图片。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助 以下是我的不同文件: items.pyScrapy 刮皮不';t下载图片,scrapy,Scrapy,我正在尝试通过scrapy从不同的URL下载图片。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助 以下是我的不同文件: items.py # -*- coding: utf-8 -*- import scrapy class PicscrapyItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field() 管道.py class Picscrap
# -*- coding: utf-8 -*-
import scrapy
class PicscrapyItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
管道.py
class PicscrapyPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for url in item['image_urls']:
if re.match(r'https', url):
yield Request(url)
def file_path(self, request, response=None, info=None):
if not isinstance(request, Request):
url = request
else:
url = request.url
image_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation
return '%s.jpg' % image_guid
设置.py
BOT_NAME = 'picScrapy'
SPIDER_MODULES = ['picScrapy.spiders']
NEWSPIDER_MODULE = 'picScrapy.spiders'
DEPTH_LIMIT = 3
IMAGES_STORE = 'F:/00'
IMAGES_MIN_WIDTH = 500
IMAGES_MIN_HEIGHT = 500
ROBOTSTXT_OBEY = False
LOG_FILE = "log"
皮卡
from urlparse import urljoin
from scrapy.spiders import Spider
from scrapy.http import Request
from picScrapy.items import PicscrapyItem
class PicSpider(Spider):
name = "pic" # 定义爬虫名
start_url = 'https://s.taobao.com' # 爬虫入口
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36',
}
def start_requests(self):
for i in range(1, 2):
# url = 'http://www.win4000.com/wallpaper_2285_0_10_%d.html' % i
url = 'https://s.taobao.com/list?spm=a217f.8051907.312344.1.353deac38xy87V&q=' \
'%E8%BF%9E%E8%A1%A3%E8%A3%99&style=' \
'grid&seller_type=taobao&cps=yes&cat=51108009&bcoffset=12&s='+str(60*i)
yield Request(url, headers=self.headers)
def parse(self, response):
item = PicscrapyItem()
item['image_urls'] = response.xpath('//img/@data-src').extract()
yield item
all_urls = response.xpath('//img/@src').extract()
for url in all_urls:
url = urljoin(self.start_url, url)
yield Request(url, callback=self.parse)
日志
2017-07-11 14:28:25[scrapy.utils.log]信息:scrapy 1.3.3已启动(bot:picScrapy)
2017-07-11 14:28:25[scrapy.utils.log]信息:覆盖的设置:{'NEWSPIDER_模块':'picScrapy.SPIDER','SPIDER_模块':['picScrapy.SPIDER','log_文件':'log','DEPTH_LIMIT':3,'BOT_NAME':'picScrapy'}
2017-07-11 14:28:25[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats']
2017-07-11 14:28:25[scrapy.middleware]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.stats.DownloaderStats']
2017-07-11 14:28:25[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2017-07-11 14:28:25[scrapy.middleware]信息:启用的项目管道:
['picScrapy.pipelines.PicscrapyPipeline']
2017-07-11 14:28:25[刮屑.堆芯.发动机]信息:蜘蛛网已打开
2017-07-11 14:28:25[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2017-07-11 14:28:25[scrapy.extensions.telnet]调试:telnet控制台监听127.0.0.1:6023
2017-07-11 14:28:26[scrapy.core.engine]调试:爬网(200)(参考:无)
2017-07-11 14:28:26[scrapy.core.scraper]调试:从
{'image_url':[],'images':[]
2017-07-11 14:28:26[刮屑芯发动机]信息:关闭卡盘(已完成)
2017-07-11 14:28:26[scrapy.statcollector]信息:倾销scrapy统计数据:
{'downloader/request_bytes':426,
“下载程序/请求计数”:1,
“downloader/request\u method\u count/GET”:1,
“downloader/response_字节”:37638,
“下载程序/响应计数”:1,
“下载程序/响应状态\计数/200”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2017,7,11,6,28,26,395000),
“物料刮取计数”:1,
“日志计数/调试”:3,
“日志计数/信息”:7,
“响应\u已接收\u计数”:1,
“调度程序/出列”:1,
“调度程序/出列/内存”:1,
“调度程序/排队”:1,
“调度程序/排队/内存”:1,
“开始时间”:datetime.datetime(2017,7,11,6,28,25778000)}
2017-07-11 14:28:26[刮屑.堆芯.发动机]信息:十字轴关闭(完成)
您需要在设置.py文件中启用管道。如果要使用刮擦管道,请将其添加到设置中:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
ITEM_PIPELINES = {'[directory].pipelines.PicscrapyPipeline': 1}
如果要使用自定义管道(如pipelines.py中的管道),可以将其添加到设置中:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
ITEM_PIPELINES = {'[directory].pipelines.PicscrapyPipeline': 1}
其中[directory]是您的pipelines.py文件所在的目录。谢谢,这些都是我没有发现的大问题,但是在我添加它们之后仍然有一些问题(t~t)。非常感谢您用正确的[directory]给出的提示!