Python 将图像下载到绝对路径

Python 将图像下载到绝对路径,python,scrapy,Python,Scrapy,如何创建一个管道来将图像存储在我创建的绝对路径中,我选中了,但找不到更改存储位置的方法 注意:我更喜欢使用scrapy,而不是使用请求实际下载图像此示例从http://books.toscrape.com/并使用管道将文件名的第一个字符放入子文件夹 I设置我将路径设置为Master 它可以是相对的 def parse_images(self,response): Name = response.meta['Name'] album = response.meta['Album

如何创建一个管道来将图像存储在我创建的绝对路径中,我选中了,但找不到更改存储位置的方法


注意:我更喜欢使用
scrapy
,而不是使用
请求
实际下载图像

此示例从
http://books.toscrape.com/
并使用
管道
将文件名的第一个字符放入子文件夹


I设置我将路径设置为
Master

它可以是相对的

def parse_images(self,response):
    Name = response.meta['Name']
    album = response.meta['Album Name']
    os.makedirs(f'Master/{Name}/{album}',exist_ok=True)
    for ind,image in enumerate(response.xpath('//ul/li/a/img')):
        img = image.xpath('@srcset').extract_first().split(', ')[-1].split()[0] #image URL
        print(img)
        imageName = f'image_{ind+1}'+os.path.splitext(img)[1] #image_1.jpg
        path = os.path.join('Master',Name,album,imageName)
        abs_path = os.path.abspath(path) #Path where I want to download
或绝对路径

 'IMAGES_STORE': 'Master',
在运行代码之前,此文件夹必须存在。如果不存在,则管道将不会创建它,也不会下载。但是
pipeline
将自动创建子文件夹,因此您不需要
makedirs()


parser
中,我将
name
album
添加到
item
中,以便将这些值发送到
管道

def解析(自我,响应): 打印('url:',response.url)


管道中
获取媒体请求()
中,我从
项中获取值,然后放入
元中,将其发送到
文件路径中,该路径生成文件的本地路径(在
图像存储中


pipeline
full_path()
中,我从
meta
中获取值,最后创建路径
name/album/image.jpg
。最初
管道
使用
hashcode
作为文件名

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        # send `meta` to `file_path()`
        yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})
这会将图像保存在
IMAGES\u STORE/name/album/image.jpg


最少的工作示例

您可以将所有代码放在一个文件中,并将其作为普通脚本运行-
pythonscript.py
,而无需创建
scrapy
项目。这样每个人都可以轻松地测试此代码

def file_path(self, request, response=None, info=None):
    # get `meta`
    name  = request.meta['name']
    album = request.meta['album']
    image = request.url.rsplit('/')[-1]
    #print('file_path:', request.url, request.meta, image)

    return '%s/%s/%s' % (name, album, image)

顺便说一句:使用

import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    #allowed_domains = []

    # see page created for scraping: http://toscrape.com/
    start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']

    def parse(self, response):
        print('url:', response.url)

        #open_in_browser(response)  # to see url in web browser

        # download images and convert to JPG (even if it is already JPG)
        for url in response.css('img::attr(src)').extract():
            url = response.urljoin(url)
            image = url.rsplit('/')[-1] # get first char from image name
            yield {'image_urls': [url], 'name': 'books', 'album': image[0]}


# --- pipelines ---

import os

# --- original code ---  # needed only if you use `image_guid`
#import hashlib        
#from scrapy.utils.python import to_bytes
# --- original code ---

class RenameImagePipeline(ImagesPipeline):
    '''Pipeline to change file names - to add folder name'''

    def get_media_requests(self, item, info):
        # --- original code ---
        #for image_url in item['image_urls']:
        #    yield scrapy.Request(image_url)
        # --- original code ---

        for image_url in item['image_urls']:
            # send `meta` to `file_path()`
            yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})

    def file_path(self, request, response=None, info=None):
        # --- original code ---
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        #return 'full/%s.jpg' % (image_guid,)
        # --- original code ---

        # get `meta`
        name  = request.meta['name']
        album = request.meta['album']
        image = request.url.rsplit('/')[-1]
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        print('file_path:', request.url, request.meta, image) #, image_guid)

        #return '%s/%s/%s.jpg' % (name, album, image_guid)
        return '%s/%s/%s' % (name, album, image)

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #

    # download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
    # it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work

    #'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},  # used standard ImagePipeline (download to IMAGES_STORE/full)
    'ITEM_PIPELINES': {'__main__.RenameImagePipeline': 1},             # used Pipeline create in current file (needs __main___)
    #'IMAGES_STORE': '/full/path/to/valid/dir',  # this folder has to exist before downloading
    'IMAGES_STORE': 'Master',  # this folder has to exist before downloading
})

c.crawl(MySpider)
c.start()
您可以找到源代码并查看它在原始
ImagePipeline
中的外观。在上面的完整示例中,我在注释中添加了一些原始代码

在Linux上我有

import scrapy
print(scrapy.__file__)


顺便说一句:
ImagePipeline
如果下载
JPG
,则将所有图像压缩到
JPG
-事件。如果要保留原始图像,则可能需要
FilePipeline
而不是
ImagePipeline
。和
文件存储
而不是
图像存储

    #open_in_browser(response)  # to see url in web browser

    # download images and convert to JPG (even if it is already JPG)
    for url in response.css('img::attr(src)').extract():
        url = response.urljoin(url)
        image = url.rsplit('/')[-1] # get first char from image name
        yield {'image_urls': [url], 'name': 'books', 'album': image[0]}

顺便说一句:管道有时会出现问题,因为它不显示错误消息(
scrapy
捕获错误并且不显示),所以很难识别管道中的代码何时出错


编辑:相同的示例,但带有
FilesPipeline
(和
FILE\u存储
项['FILE\u url']

我用短语
”而不是“
”来表示不同之处

/usr/local/lib/python3.7/dist-packages/scrapy/pipelines/images.py
/usr/local/lib/python3.7/dist-packages/scrapy/pipelines/files.py
import scrapy
从scrapy.pipelines.files导入文件管道
从scrapy.pipelines.images导入ImagesPipeline
#从scrapy.commands.view导入在浏览器中打开
#导入json
类MySpider(scrapy.Spider):
name='myspider'
#允许的_域=[]
#请参见为刮削创建的页面:http://toscrape.com/
起始URL=['http://books.toscrape.com/'] #'http://quotes.toscrape.com']
def解析(自我,响应):
打印('url:',response.url)
#在浏览器中打开(响应)#以在web浏览器中查看url
#下载所有类型的文件(无需将图像转换为JPG)
用于response.css('img::attr(src)')中的url。extract():
url=response.urljoin(url)
image=url.rsplit('/')[-1]#从图像名称中获取第一个字符
#产生{'image_url':[url],'name':'books','album':image[0]}

产生{'file_url':[url],'name':'books','album':image[0]}{p>此示例从
http://books.toscrape.com/
并使用
管道
将文件名的第一个字符放入子文件夹


I设置我将路径设置为
Master

它可以是相对的

def parse_images(self,response):
    Name = response.meta['Name']
    album = response.meta['Album Name']
    os.makedirs(f'Master/{Name}/{album}',exist_ok=True)
    for ind,image in enumerate(response.xpath('//ul/li/a/img')):
        img = image.xpath('@srcset').extract_first().split(', ')[-1].split()[0] #image URL
        print(img)
        imageName = f'image_{ind+1}'+os.path.splitext(img)[1] #image_1.jpg
        path = os.path.join('Master',Name,album,imageName)
        abs_path = os.path.abspath(path) #Path where I want to download
或绝对路径

 'IMAGES_STORE': 'Master',
在运行代码之前,此文件夹必须存在。如果不存在,则管道将不会创建它,也不会下载。但是
pipeline
将自动创建子文件夹,因此您不需要
makedirs()


parser
中,我将
name
album
添加到
item
中,以便将这些值发送到
管道

def解析(自我,响应): 打印('url:',response.url)


管道中
获取媒体请求()
中,我从
项中获取值,然后放入
元中,将其发送到
文件路径中,该路径生成文件的本地路径(在
图像存储中


pipeline
full_path()
中,我从
meta
中获取值,最后创建路径
name/album/image.jpg
。最初
管道
使用
hashcode
作为文件名

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        # send `meta` to `file_path()`
        yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})
这会将图像保存在
IMAGES\u STORE/name/album/image.jpg


最少的工作示例

您可以将所有代码放在一个文件中,并将其作为普通脚本运行-
pythonscript.py
,而无需创建
scrapy
项目。这样每个人都可以轻松地测试此代码

def file_path(self, request, response=None, info=None):
    # get `meta`
    name  = request.meta['name']
    album = request.meta['album']
    image = request.url.rsplit('/')[-1]
    #print('file_path:', request.url, request.meta, image)

    return '%s/%s/%s' % (name, album, image)

顺便说一句:使用

import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    #allowed_domains = []

    # see page created for scraping: http://toscrape.com/
    start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']

    def parse(self, response):
        print('url:', response.url)

        #open_in_browser(response)  # to see url in web browser

        # download images and convert to JPG (even if it is already JPG)
        for url in response.css('img::attr(src)').extract():
            url = response.urljoin(url)
            image = url.rsplit('/')[-1] # get first char from image name
            yield {'image_urls': [url], 'name': 'books', 'album': image[0]}


# --- pipelines ---

import os

# --- original code ---  # needed only if you use `image_guid`
#import hashlib        
#from scrapy.utils.python import to_bytes
# --- original code ---

class RenameImagePipeline(ImagesPipeline):
    '''Pipeline to change file names - to add folder name'''

    def get_media_requests(self, item, info):
        # --- original code ---
        #for image_url in item['image_urls']:
        #    yield scrapy.Request(image_url)
        # --- original code ---

        for image_url in item['image_urls']:
            # send `meta` to `file_path()`
            yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})

    def file_path(self, request, response=None, info=None):
        # --- original code ---
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        #return 'full/%s.jpg' % (image_guid,)
        # --- original code ---

        # get `meta`
        name  = request.meta['name']
        album = request.meta['album']
        image = request.url.rsplit('/')[-1]
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        print('file_path:', request.url, request.meta, image) #, image_guid)

        #return '%s/%s/%s.jpg' % (name, album, image_guid)
        return '%s/%s/%s' % (name, album, image)

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #

    # download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
    # it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work

    #'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},  # used standard ImagePipeline (download to IMAGES_STORE/full)
    'ITEM_PIPELINES': {'__main__.RenameImagePipeline': 1},             # used Pipeline create in current file (needs __main___)
    #'IMAGES_STORE': '/full/path/to/valid/dir',  # this folder has to exist before downloading
    'IMAGES_STORE': 'Master',  # this folder has to exist before downloading
})

c.crawl(MySpider)
c.start()
您可以找到源代码并查看它在原始
ImagePipeline
中的外观。在上面的完整示例中,我在注释中添加了一些原始代码

在Linux上我有

import scrapy
print(scrapy.__file__)


顺便说一句:
ImagePipeline
如果下载
JPG
,则将所有图像压缩到
JPG
-事件。如果要保留原始图像,则可能需要
FilePipeline
而不是
ImagePipeline
。和
文件存储
而不是
图像存储

    #open_in_browser(response)  # to see url in web browser

    # download images and convert to JPG (even if it is already JPG)
    for url in response.css('img::attr(src)').extract():
        url = response.urljoin(url)
        image = url.rsplit('/')[-1] # get first char from image name
        yield {'image_urls': [url], 'name': 'books', 'album': image[0]}

顺便说一句:管道有时会出现问题,因为它不会被破坏