Web scraping 刮皮不';下载图片

Web scraping 刮皮不';下载图片,web-scraping,jpeg,web-crawler,scrapy,Web Scraping,Jpeg,Web Crawler,Scrapy,我试图通过scrapy从不同的URL下载图像。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助 以下是我的不同文件: items.py from scrapy.item import Item, Field class ImagesTestItem(Item): image_urls = Field() image_names =Field() images = Field() pass s

我试图通过scrapy从不同的URL下载图像。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助

以下是我的不同文件:

items.py

from scrapy.item import Item, Field

class ImagesTestItem(Item):
    image_urls = Field()
    image_names =Field()
    images = Field()
    pass
setting.py:

from scrapy import log

log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)

BOT_NAME = 'images_test'

SPIDER_MODULES = ['images_test.spiders']
NEWSPIDER_MODULE = 'images_test.spiders'
ITEM_PIPELINES = {'images_test.pipelines.images_test': 1}
IMAGES_STORE = '/Users/Coralie/Documents/scrapy/images_test/images'
DOWNLOAD_DELAY = 5
STATS_CLASS = True
蜘蛛网:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item,Field
from scrapy.utils.response import get_base_url
import logging
from scrapy.log import ScrapyFileLogObserver

logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()

class images_test(CrawlSpider):
    name = "images_test"
    allowed_domains = ['veranstaltungszentrum.bbaw.de']
    start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0%d_g.jpg' % i for i in xrange(9)  ]

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        sites = hxs.select()
        number = 0
        for site in sites:    
            xpath = '//img/@src'
            image_urls = hxs.select('//img/@src').extract()
            item['image_urls'] = ["http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0x_g.jpg" + x for x in image_urls]
            items.append(item)
            number = number + 1
            return item

        print item['image_urls']
管道.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from PIL import Image 
from scrapy import log

log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
scrapy.log.ERROR

class images_test(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item
日志如下所示:

/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    STATS_ENABLED: no longer supported (change STATS_CLASS instead)
  warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2376,
     'downloader/request_count': 9,
     'downloader/request_method_count/GET': 9,
     'downloader/response_bytes': 343660,
     'downloader/response_count': 9,
     'downloader/response_status_count/200': 8,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
     'log_count/DEBUG': 15,
     'log_count/ERROR': 1,
     'log_count/INFO': 3,
     'log_count/WARNING': 1,
     'response_received_count': 9,
     'scheduler/dequeued': 9,
     'scheduler/dequeued/memory': 9,
     'scheduler/enqueued': 9,
     'scheduler/enqueued/memory': 9,
     'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']

def parse(self, response):
        sel = HtmlXPathSelector(response)
        item = ImagesTestItem()
        url = 'http://veranstaltungszentrum.bbaw.de'
        return item['image_urls'] = [urljoin(url, x) for x in 
                                             sel.select('//img/@src').extract())]
/Library/Python/2.7/site packages/Scrapy-0.20.2-py2.7.egg/Scrapy/settings/deprecated.py:26:scrapydeproduction警告:您正在使用以下已弃用或过时的设置(询问Scrapy-users@googlegroups.com备选方案):
已启用统计信息:不再受支持(改为更改统计信息类)
警告。警告(消息,ScrapyDepreactionWarning)
2014-01-03 11:36:48+0100[scrapy]信息:scrapy 0.20.2启动(机器人:图像测试)
2014-01-03 11:36:48+0100[scrapy]调试:可选功能可用:ssl、http11
2014-01-03 11:36:48+0100[scrapy]调试:覆盖的设置:{'NEWSPIDER_模块':'images_test.SPIDER','SPIDER_模块':['images_test.SPIDER','DOWNLOAD_DELAY','DOWNLOAD_DELAY':5,'BOT_NAME':'images_test'}
2014-01-03 11:36:48+0100[scrapy]调试:启用的扩展:LogStats、TelnetConsole、CloseSpider、WebService、CoreStats、SpiderState
2014-01-03 11:36:49+0100[scrapy]调试:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2014-01-03 11:36:49+0100[scrapy]调试:启用的spider中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2014-01-03 11:36:49+0100[scrapy]警告:这是一个警告
2014-01-03 11:36:49+0100[scrapy]错误:这是一个错误
2014-01-03 11:36:49+0100[scrapy]调试:启用项目管道:图像\u测试
2014-01-03 11:36:49+0100[图像测试]信息:蜘蛛网已打开
2014-01-03 11:36:49+0100[图片测试]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2014-01-03 11:36:49+0100[scrapy]调试:Telnet控制台在0.0.0.0:6023上侦听
2014-01-03 11:36:49+0100[scrapy]调试:Web服务侦听0.0.0.0:6080
2014-01-03 11:36:49+0100[图像测试]调试:爬网(404)(参考:无)
2014-01-03 11:36:55+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:36:59+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:05+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:10+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:16+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:22+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:29+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:36+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:36+0100[图像测试]信息:关闭卡盘(完成)
2014-01-03 11:37:36+0100[图片测试]信息:倾倒碎屑统计数据:
{'downloader/request_bytes':2376,
“下载程序/请求计数”:9,
“下载程序/请求方法\计数/获取”:9,
“downloader/response_字节”:343660,
“下载程序/响应计数”:9,
“下载/响应状态\计数/200”:8,
“下载程序/响应状态\计数/404”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2014,1,3,10,37,36,166139),
“日志计数/调试”:15,
“日志计数/错误”:1,
“日志计数/信息”:3,
“日志计数/警告”:1,
“收到的响应数”:9,
“调度程序/出列”:9,
“调度程序/出列/内存”:9,
“调度程序/排队”:9,
“调度程序/排队/内存”:9,
“开始时间”:datetime.datetime(2014,1,3,10,36,4937947)}
2014-01-03 11:37:36+0100[图像测试]信息:十字轴关闭(完成)
为什么图像无法保存?甚至我的打印项['image_url']命令也没有被执行


谢谢

考虑将您的spider代码更改为以下内容:

/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    STATS_ENABLED: no longer supported (change STATS_CLASS instead)
  warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2376,
     'downloader/request_count': 9,
     'downloader/request_method_count/GET': 9,
     'downloader/response_bytes': 343660,
     'downloader/response_count': 9,
     'downloader/response_status_count/200': 8,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
     'log_count/DEBUG': 15,
     'log_count/ERROR': 1,
     'log_count/INFO': 3,
     'log_count/WARNING': 1,
     'response_received_count': 9,
     'scheduler/dequeued': 9,
     'scheduler/dequeued/memory': 9,
     'scheduler/enqueued': 9,
     'scheduler/enqueued/memory': 9,
     'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']

def parse(self, response):
        sel = HtmlXPathSelector(response)
        item = ImagesTestItem()
        url = 'http://veranstaltungszentrum.bbaw.de'
        return item['image_urls'] = [urljoin(url, x) for x in 
                                             sel.select('//img/@src').extract())]

HtmlXPathSelector
只能解析html文档,似乎您从
start\u URL向其提供了图像

您可以尝试不使用管道:

def parse(self,response):
   #extract your images url
   imageurl = response.xpath("//img/@src").get()
   imagename = imageurl.split("/")[-1]
   req = urllib.request.Request(imageurl, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'})
   resource = urllib.request.urlopen(req)
   output = open("foldername/"+imagename,"wb")
   output.write(resource.read())
   output.close()