Python 将图像下载到绝对路径
如何创建一个管道来将图像存储在我创建的绝对路径中,我选中了,但找不到更改存储位置的方法Python 将图像下载到绝对路径,python,scrapy,Python,Scrapy,如何创建一个管道来将图像存储在我创建的绝对路径中,我选中了,但找不到更改存储位置的方法 注意:我更喜欢使用scrapy,而不是使用请求实际下载图像此示例从http://books.toscrape.com/并使用管道将文件名的第一个字符放入子文件夹 I设置我将路径设置为Master 它可以是相对的 def parse_images(self,response): Name = response.meta['Name'] album = response.meta['Album
注意:我更喜欢使用
scrapy
,而不是使用请求
实际下载图像此示例从http://books.toscrape.com/
并使用管道
将文件名的第一个字符放入子文件夹
I设置我将路径设置为
Master
它可以是相对的
def parse_images(self,response):
Name = response.meta['Name']
album = response.meta['Album Name']
os.makedirs(f'Master/{Name}/{album}',exist_ok=True)
for ind,image in enumerate(response.xpath('//ul/li/a/img')):
img = image.xpath('@srcset').extract_first().split(', ')[-1].split()[0] #image URL
print(img)
imageName = f'image_{ind+1}'+os.path.splitext(img)[1] #image_1.jpg
path = os.path.join('Master',Name,album,imageName)
abs_path = os.path.abspath(path) #Path where I want to download
或绝对路径
'IMAGES_STORE': 'Master',
在运行代码之前,此文件夹必须存在。如果不存在,则管道将不会创建它,也不会下载。但是pipeline
将自动创建子文件夹,因此您不需要makedirs()
在
parser
中,我将name
和album
添加到item
中,以便将这些值发送到管道
def解析(自我,响应):
打印('url:',response.url)
在
管道中在获取媒体请求()
中,我从项中获取值,然后放入元中,将其发送到文件路径中,该路径生成文件的本地路径(在图像存储中
)
在pipeline
的full_path()
中,我从meta
中获取值,最后创建路径name/album/image.jpg
。最初管道
使用hashcode
作为文件名
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
# send `meta` to `file_path()`
yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})
这会将图像保存在IMAGES\u STORE/name/album/image.jpg
最少的工作示例
您可以将所有代码放在一个文件中,并将其作为普通脚本运行-pythonscript.py
,而无需创建scrapy
项目。这样每个人都可以轻松地测试此代码
def file_path(self, request, response=None, info=None):
# get `meta`
name = request.meta['name']
album = request.meta['album']
image = request.url.rsplit('/')[-1]
#print('file_path:', request.url, request.meta, image)
return '%s/%s/%s' % (name, album, image)
顺便说一句:使用
import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json
class MySpider(scrapy.Spider):
name = 'myspider'
#allowed_domains = []
# see page created for scraping: http://toscrape.com/
start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']
def parse(self, response):
print('url:', response.url)
#open_in_browser(response) # to see url in web browser
# download images and convert to JPG (even if it is already JPG)
for url in response.css('img::attr(src)').extract():
url = response.urljoin(url)
image = url.rsplit('/')[-1] # get first char from image name
yield {'image_urls': [url], 'name': 'books', 'album': image[0]}
# --- pipelines ---
import os
# --- original code --- # needed only if you use `image_guid`
#import hashlib
#from scrapy.utils.python import to_bytes
# --- original code ---
class RenameImagePipeline(ImagesPipeline):
'''Pipeline to change file names - to add folder name'''
def get_media_requests(self, item, info):
# --- original code ---
#for image_url in item['image_urls']:
# yield scrapy.Request(image_url)
# --- original code ---
for image_url in item['image_urls']:
# send `meta` to `file_path()`
yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})
def file_path(self, request, response=None, info=None):
# --- original code ---
#image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
#return 'full/%s.jpg' % (image_guid,)
# --- original code ---
# get `meta`
name = request.meta['name']
album = request.meta['album']
image = request.url.rsplit('/')[-1]
#image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
print('file_path:', request.url, request.meta, image) #, image_guid)
#return '%s/%s/%s.jpg' % (name, album, image_guid)
return '%s/%s/%s' % (name, album, image)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
# download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
# it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work
#'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1}, # used standard ImagePipeline (download to IMAGES_STORE/full)
'ITEM_PIPELINES': {'__main__.RenameImagePipeline': 1}, # used Pipeline create in current file (needs __main___)
#'IMAGES_STORE': '/full/path/to/valid/dir', # this folder has to exist before downloading
'IMAGES_STORE': 'Master', # this folder has to exist before downloading
})
c.crawl(MySpider)
c.start()
您可以找到源代码并查看它在原始ImagePipeline
中的外观。在上面的完整示例中,我在注释中添加了一些原始代码
在Linux上我有
import scrapy
print(scrapy.__file__)
及
顺便说一句:ImagePipeline
如果下载JPG
,则将所有图像压缩到JPG
-事件。如果要保留原始图像,则可能需要FilePipeline
而不是ImagePipeline
。和文件存储
而不是图像存储
#open_in_browser(response) # to see url in web browser
# download images and convert to JPG (even if it is already JPG)
for url in response.css('img::attr(src)').extract():
url = response.urljoin(url)
image = url.rsplit('/')[-1] # get first char from image name
yield {'image_urls': [url], 'name': 'books', 'album': image[0]}
顺便说一句:管道有时会出现问题,因为它不显示错误消息(scrapy
捕获错误并且不显示),所以很难识别管道中的代码何时出错
编辑:相同的示例,但带有FilesPipeline
(和FILE\u存储
和项['FILE\u url']
)
我用短语”而不是“
”来表示不同之处
/usr/local/lib/python3.7/dist-packages/scrapy/pipelines/images.py
/usr/local/lib/python3.7/dist-packages/scrapy/pipelines/files.py
import scrapy
从scrapy.pipelines.files导入文件管道
从scrapy.pipelines.images导入ImagesPipeline
#从scrapy.commands.view导入在浏览器中打开
#导入json
类MySpider(scrapy.Spider):
name='myspider'
#允许的_域=[]
#请参见为刮削创建的页面:http://toscrape.com/
起始URL=['http://books.toscrape.com/'] #'http://quotes.toscrape.com']
def解析(自我,响应):
打印('url:',response.url)
#在浏览器中打开(响应)#以在web浏览器中查看url
#下载所有类型的文件(无需将图像转换为JPG)
用于response.css('img::attr(src)')中的url。extract():
url=response.urljoin(url)
image=url.rsplit('/')[-1]#从图像名称中获取第一个字符
#产生{'image_url':[url],'name':'books','album':image[0]}
产生{'file_url':[url],'name':'books','album':image[0]}{p>此示例从http://books.toscrape.com/
并使用管道
将文件名的第一个字符放入子文件夹
I设置我将路径设置为Master
它可以是相对的
def parse_images(self,response):
Name = response.meta['Name']
album = response.meta['Album Name']
os.makedirs(f'Master/{Name}/{album}',exist_ok=True)
for ind,image in enumerate(response.xpath('//ul/li/a/img')):
img = image.xpath('@srcset').extract_first().split(', ')[-1].split()[0] #image URL
print(img)
imageName = f'image_{ind+1}'+os.path.splitext(img)[1] #image_1.jpg
path = os.path.join('Master',Name,album,imageName)
abs_path = os.path.abspath(path) #Path where I want to download
或绝对路径
'IMAGES_STORE': 'Master',
在运行代码之前,此文件夹必须存在。如果不存在,则管道将不会创建它,也不会下载。但是pipeline
将自动创建子文件夹,因此您不需要makedirs()
在parser
中,我将name
和album
添加到item
中,以便将这些值发送到管道
def解析(自我,响应):
打印('url:',response.url)
在管道中在获取媒体请求()
中,我从项中获取值,然后放入元中,将其发送到文件路径中,该路径生成文件的本地路径(在图像存储中
)
在pipeline
的full_path()
中,我从meta
中获取值,最后创建路径name/album/image.jpg
。最初管道
使用hashcode
作为文件名
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
# send `meta` to `file_path()`
yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})
这会将图像保存在IMAGES\u STORE/name/album/image.jpg
最少的工作示例
您可以将所有代码放在一个文件中,并将其作为普通脚本运行-pythonscript.py
,而无需创建scrapy
项目。这样每个人都可以轻松地测试此代码
def file_path(self, request, response=None, info=None):
# get `meta`
name = request.meta['name']
album = request.meta['album']
image = request.url.rsplit('/')[-1]
#print('file_path:', request.url, request.meta, image)
return '%s/%s/%s' % (name, album, image)
顺便说一句:使用
import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json
class MySpider(scrapy.Spider):
name = 'myspider'
#allowed_domains = []
# see page created for scraping: http://toscrape.com/
start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']
def parse(self, response):
print('url:', response.url)
#open_in_browser(response) # to see url in web browser
# download images and convert to JPG (even if it is already JPG)
for url in response.css('img::attr(src)').extract():
url = response.urljoin(url)
image = url.rsplit('/')[-1] # get first char from image name
yield {'image_urls': [url], 'name': 'books', 'album': image[0]}
# --- pipelines ---
import os
# --- original code --- # needed only if you use `image_guid`
#import hashlib
#from scrapy.utils.python import to_bytes
# --- original code ---
class RenameImagePipeline(ImagesPipeline):
'''Pipeline to change file names - to add folder name'''
def get_media_requests(self, item, info):
# --- original code ---
#for image_url in item['image_urls']:
# yield scrapy.Request(image_url)
# --- original code ---
for image_url in item['image_urls']:
# send `meta` to `file_path()`
yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})
def file_path(self, request, response=None, info=None):
# --- original code ---
#image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
#return 'full/%s.jpg' % (image_guid,)
# --- original code ---
# get `meta`
name = request.meta['name']
album = request.meta['album']
image = request.url.rsplit('/')[-1]
#image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
print('file_path:', request.url, request.meta, image) #, image_guid)
#return '%s/%s/%s.jpg' % (name, album, image_guid)
return '%s/%s/%s' % (name, album, image)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
# download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
# it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work
#'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1}, # used standard ImagePipeline (download to IMAGES_STORE/full)
'ITEM_PIPELINES': {'__main__.RenameImagePipeline': 1}, # used Pipeline create in current file (needs __main___)
#'IMAGES_STORE': '/full/path/to/valid/dir', # this folder has to exist before downloading
'IMAGES_STORE': 'Master', # this folder has to exist before downloading
})
c.crawl(MySpider)
c.start()
您可以找到源代码并查看它在原始ImagePipeline
中的外观。在上面的完整示例中,我在注释中添加了一些原始代码
在Linux上我有
import scrapy
print(scrapy.__file__)
及
顺便说一句:ImagePipeline
如果下载JPG
,则将所有图像压缩到JPG
-事件。如果要保留原始图像,则可能需要FilePipeline
而不是ImagePipeline
。和文件存储
而不是图像存储
#open_in_browser(response) # to see url in web browser
# download images and convert to JPG (even if it is already JPG)
for url in response.css('img::attr(src)').extract():
url = response.urljoin(url)
image = url.rsplit('/')[-1] # get first char from image name
yield {'image_urls': [url], 'name': 'books', 'album': image[0]}
顺便说一句:管道有时会出现问题,因为它不会被破坏