Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/codeigniter/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scrapy 刮皮不';t下载图片_Scrapy - Fatal编程技术网

Scrapy 刮皮不';t下载图片

Scrapy 刮皮不';t下载图片,scrapy,Scrapy,我正在尝试通过scrapy从不同的URL下载图片。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助 以下是我的不同文件: items.py # -*- coding: utf-8 -*- import scrapy class PicscrapyItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field() 管道.py class Picscrap

我正在尝试通过scrapy从不同的URL下载图片。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助

以下是我的不同文件:

items.py

# -*- coding: utf-8 -*-
import scrapy
class PicscrapyItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
管道.py

class PicscrapyPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
    for url in item['image_urls']:
        if re.match(r'https', url):
            yield Request(url)

def file_path(self, request, response=None, info=None):
    if not isinstance(request, Request):
        url = request
    else:
        url = request.url
    image_guid = hashlib.sha1(to_bytes(url)).hexdigest()  # change to request.url after deprecation
    return '%s.jpg' % image_guid
设置.py

BOT_NAME = 'picScrapy'
SPIDER_MODULES = ['picScrapy.spiders']
NEWSPIDER_MODULE = 'picScrapy.spiders'
DEPTH_LIMIT = 3
IMAGES_STORE = 'F:/00'
IMAGES_MIN_WIDTH = 500
IMAGES_MIN_HEIGHT = 500
ROBOTSTXT_OBEY = False
LOG_FILE = "log"
皮卡

from urlparse import urljoin
from scrapy.spiders import Spider
from scrapy.http import Request
from picScrapy.items import PicscrapyItem


class PicSpider(Spider):
    name = "pic"  # 定义爬虫名
    start_url = 'https://s.taobao.com'  # 爬虫入口
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36',
    }

def start_requests(self):
    for i in range(1, 2):
        # url = 'http://www.win4000.com/wallpaper_2285_0_10_%d.html' % i
        url = 'https://s.taobao.com/list?spm=a217f.8051907.312344.1.353deac38xy87V&q=' \
              '%E8%BF%9E%E8%A1%A3%E8%A3%99&style=' \
              'grid&seller_type=taobao&cps=yes&cat=51108009&bcoffset=12&s='+str(60*i)
        yield Request(url, headers=self.headers)

def parse(self, response):
    item = PicscrapyItem()
    item['image_urls'] = response.xpath('//img/@data-src').extract()
    yield item

    all_urls = response.xpath('//img/@src').extract()
    for url in all_urls:
        url = urljoin(self.start_url, url)
        yield Request(url, callback=self.parse)
日志

2017-07-11 14:28:25[scrapy.utils.log]信息:scrapy 1.3.3已启动(bot:picScrapy)
2017-07-11 14:28:25[scrapy.utils.log]信息:覆盖的设置:{'NEWSPIDER_模块':'picScrapy.SPIDER','SPIDER_模块':['picScrapy.SPIDER','log_文件':'log','DEPTH_LIMIT':3,'BOT_NAME':'picScrapy'}
2017-07-11 14:28:25[scrapy.middleware]信息:启用的扩展:
['scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats']
2017-07-11 14:28:25[scrapy.middleware]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.stats.DownloaderStats']
2017-07-11 14:28:25[scrapy.middleware]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2017-07-11 14:28:25[scrapy.middleware]信息:启用的项目管道:
['picScrapy.pipelines.PicscrapyPipeline']
2017-07-11 14:28:25[刮屑.堆芯.发动机]信息:蜘蛛网已打开
2017-07-11 14:28:25[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2017-07-11 14:28:25[scrapy.extensions.telnet]调试:telnet控制台监听127.0.0.1:6023
2017-07-11 14:28:26[scrapy.core.engine]调试:爬网(200)(参考:无)
2017-07-11 14:28:26[scrapy.core.scraper]调试:从
{'image_url':[],'images':[]
2017-07-11 14:28:26[刮屑芯发动机]信息:关闭卡盘(已完成)
2017-07-11 14:28:26[scrapy.statcollector]信息:倾销scrapy统计数据:
{'downloader/request_bytes':426,
“下载程序/请求计数”:1,
“downloader/request\u method\u count/GET”:1,
“downloader/response_字节”:37638,
“下载程序/响应计数”:1,
“下载程序/响应状态\计数/200”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2017,7,11,6,28,26,395000),
“物料刮取计数”:1,
“日志计数/调试”:3,
“日志计数/信息”:7,
“响应\u已接收\u计数”:1,
“调度程序/出列”:1,
“调度程序/出列/内存”:1,
“调度程序/排队”:1,
“调度程序/排队/内存”:1,
“开始时间”:datetime.datetime(2017,7,11,6,28,25778000)}
2017-07-11 14:28:26[刮屑.堆芯.发动机]信息:十字轴关闭(完成)

您需要在设置.py文件中启用管道。如果要使用刮擦管道,请将其添加到设置中:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
ITEM_PIPELINES = {'[directory].pipelines.PicscrapyPipeline': 1} 
如果要使用自定义管道(如pipelines.py中的管道),可以将其添加到设置中:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
ITEM_PIPELINES = {'[directory].pipelines.PicscrapyPipeline': 1} 

其中[directory]是您的pipelines.py

文件所在的目录。谢谢,这些都是我没有发现的大问题,但是在我添加它们之后仍然有一些问题(t~t)。非常感谢您用正确的[directory]给出的提示!