Web scraping 刮皮不';下载图片
我试图通过scrapy从不同的URL下载图像。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助 以下是我的不同文件: items.pyWeb scraping 刮皮不';下载图片,web-scraping,jpeg,web-crawler,scrapy,Web Scraping,Jpeg,Web Crawler,Scrapy,我试图通过scrapy从不同的URL下载图像。我是python和scrapy的新手,所以可能我遗漏了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢您的帮助 以下是我的不同文件: items.py from scrapy.item import Item, Field class ImagesTestItem(Item): image_urls = Field() image_names =Field() images = Field() pass s
from scrapy.item import Item, Field
class ImagesTestItem(Item):
image_urls = Field()
image_names =Field()
images = Field()
pass
setting.py:
from scrapy import log
log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
BOT_NAME = 'images_test'
SPIDER_MODULES = ['images_test.spiders']
NEWSPIDER_MODULE = 'images_test.spiders'
ITEM_PIPELINES = {'images_test.pipelines.images_test': 1}
IMAGES_STORE = '/Users/Coralie/Documents/scrapy/images_test/images'
DOWNLOAD_DELAY = 5
STATS_CLASS = True
蜘蛛网:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item,Field
from scrapy.utils.response import get_base_url
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class images_test(CrawlSpider):
name = "images_test"
allowed_domains = ['veranstaltungszentrum.bbaw.de']
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0%d_g.jpg' % i for i in xrange(9) ]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
sites = hxs.select()
number = 0
for site in sites:
xpath = '//img/@src'
image_urls = hxs.select('//img/@src').extract()
item['image_urls'] = ["http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0x_g.jpg" + x for x in image_urls]
items.append(item)
number = number + 1
return item
print item['image_urls']
管道.py
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from PIL import Image
from scrapy import log
log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
scrapy.log.ERROR
class images_test(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
日志如下所示:
/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
STATS_ENABLED: no longer supported (change STATS_CLASS instead)
warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2376,
'downloader/request_count': 9,
'downloader/request_method_count/GET': 9,
'downloader/response_bytes': 343660,
'downloader/response_count': 9,
'downloader/response_status_count/200': 8,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
'log_count/DEBUG': 15,
'log_count/ERROR': 1,
'log_count/INFO': 3,
'log_count/WARNING': 1,
'response_received_count': 9,
'scheduler/dequeued': 9,
'scheduler/dequeued/memory': 9,
'scheduler/enqueued': 9,
'scheduler/enqueued/memory': 9,
'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']
def parse(self, response):
sel = HtmlXPathSelector(response)
item = ImagesTestItem()
url = 'http://veranstaltungszentrum.bbaw.de'
return item['image_urls'] = [urljoin(url, x) for x in
sel.select('//img/@src').extract())]
/Library/Python/2.7/site packages/Scrapy-0.20.2-py2.7.egg/Scrapy/settings/deprecated.py:26:scrapydeproduction警告:您正在使用以下已弃用或过时的设置(询问Scrapy-users@googlegroups.com备选方案):
已启用统计信息:不再受支持(改为更改统计信息类)
警告。警告(消息,ScrapyDepreactionWarning)
2014-01-03 11:36:48+0100[scrapy]信息:scrapy 0.20.2启动(机器人:图像测试)
2014-01-03 11:36:48+0100[scrapy]调试:可选功能可用:ssl、http11
2014-01-03 11:36:48+0100[scrapy]调试:覆盖的设置:{'NEWSPIDER_模块':'images_test.SPIDER','SPIDER_模块':['images_test.SPIDER','DOWNLOAD_DELAY','DOWNLOAD_DELAY':5,'BOT_NAME':'images_test'}
2014-01-03 11:36:48+0100[scrapy]调试:启用的扩展:LogStats、TelnetConsole、CloseSpider、WebService、CoreStats、SpiderState
2014-01-03 11:36:49+0100[scrapy]调试:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2014-01-03 11:36:49+0100[scrapy]调试:启用的spider中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2014-01-03 11:36:49+0100[scrapy]警告:这是一个警告
2014-01-03 11:36:49+0100[scrapy]错误:这是一个错误
2014-01-03 11:36:49+0100[scrapy]调试:启用项目管道:图像\u测试
2014-01-03 11:36:49+0100[图像测试]信息:蜘蛛网已打开
2014-01-03 11:36:49+0100[图片测试]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2014-01-03 11:36:49+0100[scrapy]调试:Telnet控制台在0.0.0.0:6023上侦听
2014-01-03 11:36:49+0100[scrapy]调试:Web服务侦听0.0.0.0:6080
2014-01-03 11:36:49+0100[图像测试]调试:爬网(404)(参考:无)
2014-01-03 11:36:55+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:36:59+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:05+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:10+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:16+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:22+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:29+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:36+0100[图像测试]调试:爬网(200)(参考:无)
2014-01-03 11:37:36+0100[图像测试]信息:关闭卡盘(完成)
2014-01-03 11:37:36+0100[图片测试]信息:倾倒碎屑统计数据:
{'downloader/request_bytes':2376,
“下载程序/请求计数”:9,
“下载程序/请求方法\计数/获取”:9,
“downloader/response_字节”:343660,
“下载程序/响应计数”:9,
“下载/响应状态\计数/200”:8,
“下载程序/响应状态\计数/404”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2014,1,3,10,37,36,166139),
“日志计数/调试”:15,
“日志计数/错误”:1,
“日志计数/信息”:3,
“日志计数/警告”:1,
“收到的响应数”:9,
“调度程序/出列”:9,
“调度程序/出列/内存”:9,
“调度程序/排队”:9,
“调度程序/排队/内存”:9,
“开始时间”:datetime.datetime(2014,1,3,10,36,4937947)}
2014-01-03 11:37:36+0100[图像测试]信息:十字轴关闭(完成)
为什么图像无法保存?甚至我的打印项['image_url']命令也没有被执行
谢谢考虑将您的spider代码更改为以下内容:
/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
STATS_ENABLED: no longer supported (change STATS_CLASS instead)
warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2376,
'downloader/request_count': 9,
'downloader/request_method_count/GET': 9,
'downloader/response_bytes': 343660,
'downloader/response_count': 9,
'downloader/response_status_count/200': 8,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
'log_count/DEBUG': 15,
'log_count/ERROR': 1,
'log_count/INFO': 3,
'log_count/WARNING': 1,
'response_received_count': 9,
'scheduler/dequeued': 9,
'scheduler/dequeued/memory': 9,
'scheduler/enqueued': 9,
'scheduler/enqueued/memory': 9,
'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']
def parse(self, response):
sel = HtmlXPathSelector(response)
item = ImagesTestItem()
url = 'http://veranstaltungszentrum.bbaw.de'
return item['image_urls'] = [urljoin(url, x) for x in
sel.select('//img/@src').extract())]
HtmlXPathSelector
只能解析html文档,似乎您从start\u URL向其提供了图像
您可以尝试不使用管道:
def parse(self,response):
#extract your images url
imageurl = response.xpath("//img/@src").get()
imagename = imageurl.split("/")[-1]
req = urllib.request.Request(imageurl, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'})
resource = urllib.request.urlopen(req)
output = open("foldername/"+imagename,"wb")
output.write(resource.read())
output.close()