Python 将图像下载到绝对路径_Python_Scrapy

Python 将图像下载到绝对路径

python scrapy

Python 将图像下载到绝对路径,python,scrapy,Python,Scrapy,如何创建一个管道来将图像存储在我创建的绝对路径中，我选中了，但找不到更改存储位置的方法注意：我更喜欢使用scrapy，而不是使用请求实际下载图像此示例从http://books.toscrape.com/并使用管道将文件名的第一个字符放入子文件夹 I设置我将路径设置为Master 它可以是相对的 def parse_images(self,response): Name = response.meta['Name'] album = response.meta['Album

如何创建一个管道来将图像存储在我创建的绝对路径中，我选中了，但找不到更改存储位置的方法

注意：我更喜欢使用

scrapy

，而不是使用

请求

实际下载图像

此示例从

http://books.toscrape.com/

并使用

管道

将文件名的第一个字符放入子文件夹

I设置我将路径设置为

Master

它可以是相对的

def parse_images(self,response):
    Name = response.meta['Name']
    album = response.meta['Album Name']
    os.makedirs(f'Master/{Name}/{album}',exist_ok=True)
    for ind,image in enumerate(response.xpath('//ul/li/a/img')):
        img = image.xpath('@srcset').extract_first().split(', ')[-1].split()[0] #image URL
        print(img)
        imageName = f'image_{ind+1}'+os.path.splitext(img)[1] #image_1.jpg
        path = os.path.join('Master',Name,album,imageName)
        abs_path = os.path.abspath(path) #Path where I want to download

或绝对路径

 'IMAGES_STORE': 'Master',

在运行代码之前，此文件夹必须存在。如果不存在，则管道将不会创建它，也不会下载。但是

pipeline

将自动创建子文件夹，因此您不需要

makedirs（）

在

parser

中，我将

name

和

album

添加到

item

中，以便将这些值发送到

管道

def解析（自我，响应）：打印（'url:'，response.url）

在

管道中在获取媒体请求（）
中，我从项中获取值，然后放入元中，将其发送到文件路径中，该路径生成文件的本地路径（在图像存储中
）

在pipeline
的full_path（）
中，我从meta
中获取值，最后创建路径name/album/image.jpg
。最初管道
使用hashcode
作为文件名
def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        # send `meta` to `file_path()`
        yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})

这会将图像保存在IMAGES\u STORE/name/album/image.jpg


最少的工作示例
您可以将所有代码放在一个文件中，并将其作为普通脚本运行-pythonscript.py
，而无需创建scrapy
项目。这样每个人都可以轻松地测试此代码
def file_path(self, request, response=None, info=None):
    # get `meta`
    name  = request.meta['name']
    album = request.meta['album']
    image = request.url.rsplit('/')[-1]
    #print('file_path:', request.url, request.meta, image)

    return '%s/%s/%s' % (name, album, image)


顺便说一句：使用
import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    #allowed_domains = []

    # see page created for scraping: http://toscrape.com/
    start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']

    def parse(self, response):
        print('url:', response.url)

        #open_in_browser(response)  # to see url in web browser

        # download images and convert to JPG (even if it is already JPG)
        for url in response.css('img::attr(src)').extract():
            url = response.urljoin(url)
            image = url.rsplit('/')[-1] # get first char from image name
            yield {'image_urls': [url], 'name': 'books', 'album': image[0]}


# --- pipelines ---

import os

# --- original code ---  # needed only if you use `image_guid`
#import hashlib        
#from scrapy.utils.python import to_bytes
# --- original code ---

class RenameImagePipeline(ImagesPipeline):
    '''Pipeline to change file names - to add folder name'''

    def get_media_requests(self, item, info):
        # --- original code ---
        #for image_url in item['image_urls']:
        #    yield scrapy.Request(image_url)
        # --- original code ---

        for image_url in item['image_urls']:
            # send `meta` to `file_path()`
            yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})

    def file_path(self, request, response=None, info=None):
        # --- original code ---
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        #return 'full/%s.jpg' % (image_guid,)
        # --- original code ---

        # get `meta`
        name  = request.meta['name']
        album = request.meta['album']
        image = request.url.rsplit('/')[-1]
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        print('file_path:', request.url, request.meta, image) #, image_guid)

        #return '%s/%s/%s.jpg' % (name, album, image_guid)
        return '%s/%s/%s' % (name, album, image)

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #

    # download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
    # it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work

    #'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},  # used standard ImagePipeline (download to IMAGES_STORE/full)
    'ITEM_PIPELINES': {'__main__.RenameImagePipeline': 1},             # used Pipeline create in current file (needs __main___)
    #'IMAGES_STORE': '/full/path/to/valid/dir',  # this folder has to exist before downloading
    'IMAGES_STORE': 'Master',  # this folder has to exist before downloading
})

c.crawl(MySpider)
c.start()

您可以找到源代码并查看它在原始ImagePipeline
中的外观。在上面的完整示例中，我在注释中添加了一些原始代码
在Linux上我有
import scrapy
print(scrapy.__file__)

及

顺便说一句：ImagePipeline
如果下载JPG
，则将所有图像压缩到JPG
-事件。如果要保留原始图像，则可能需要FilePipeline
而不是ImagePipeline
。和文件存储
而不是图像存储

    #open_in_browser(response)  # to see url in web browser

    # download images and convert to JPG (even if it is already JPG)
    for url in response.css('img::attr(src)').extract():
        url = response.urljoin(url)
        image = url.rsplit('/')[-1] # get first char from image name
        yield {'image_urls': [url], 'name': 'books', 'album': image[0]}


顺便说一句：管道有时会出现问题，因为它不显示错误消息（scrapy
捕获错误并且不显示），所以很难识别管道中的代码何时出错

编辑：相同的示例，但带有FilesPipeline
（和FILE\u存储
和项['FILE\u url']
）
我用短语”而不是“
”来表示不同之处
/usr/local/lib/python3.7/dist-packages/scrapy/pipelines/images.py
/usr/local/lib/python3.7/dist-packages/scrapy/pipelines/files.py

import scrapy
从scrapy.pipelines.files导入文件管道
从scrapy.pipelines.images导入ImagesPipeline
#从scrapy.commands.view导入在浏览器中打开
#导入json
类MySpider（scrapy.Spider）：
name='myspider'
#允许的_域=[]
#请参见为刮削创建的页面：http://toscrape.com/
起始URL=['http://books.toscrape.com/'] #'http://quotes.toscrape.com']
def解析（自我，响应）：
打印（'url:'，response.url）
#在浏览器中打开（响应）#以在web浏览器中查看url
#下载所有类型的文件（无需将图像转换为JPG）
用于response.css（'img:：attr（src）'）中的url。extract（）：
url=response.urljoin（url）
image=url.rsplit（'/'）[-1]#从图像名称中获取第一个字符
#产生{'image_url'：[url]，'name'：'books'，'album'：image[0]}
产生{'file_url'：[url]，'name'：'books'，'album'：image[0]}{p>此示例从http://books.toscrape.com/
并使用管道
将文件名的第一个字符放入子文件夹

I设置我将路径设置为Master

它可以是相对的
def parse_images(self,response):
    Name = response.meta['Name']
    album = response.meta['Album Name']
    os.makedirs(f'Master/{Name}/{album}',exist_ok=True)
    for ind,image in enumerate(response.xpath('//ul/li/a/img')):
        img = image.xpath('@srcset').extract_first().split(', ')[-1].split()[0] #image URL
        print(img)
        imageName = f'image_{ind+1}'+os.path.splitext(img)[1] #image_1.jpg
        path = os.path.join('Master',Name,album,imageName)
        abs_path = os.path.abspath(path) #Path where I want to download

或绝对路径
 'IMAGES_STORE': 'Master',

在运行代码之前，此文件夹必须存在。如果不存在，则管道将不会创建它，也不会下载。但是pipeline
将自动创建子文件夹，因此您不需要makedirs（）


在parser
中，我将name
和album
添加到item
中，以便将这些值发送到管道

def解析（自我，响应）：
打印（'url:'，response.url）

在管道中在获取媒体请求（）
中，我从项中获取值，然后放入元中，将其发送到文件路径中，该路径生成文件的本地路径（在图像存储中
）

在pipeline
的full_path（）
中，我从meta
中获取值，最后创建路径name/album/image.jpg
。最初管道
使用hashcode
作为文件名
def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        # send `meta` to `file_path()`
        yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})

这会将图像保存在IMAGES\u STORE/name/album/image.jpg


最少的工作示例
您可以将所有代码放在一个文件中，并将其作为普通脚本运行-pythonscript.py
，而无需创建scrapy
项目。这样每个人都可以轻松地测试此代码
def file_path(self, request, response=None, info=None):
    # get `meta`
    name  = request.meta['name']
    album = request.meta['album']
    image = request.url.rsplit('/')[-1]
    #print('file_path:', request.url, request.meta, image)

    return '%s/%s/%s' % (name, album, image)


顺便说一句：使用
import scrapy
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
#from scrapy.commands.view import open_in_browser
#import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    #allowed_domains = []

    # see page created for scraping: http://toscrape.com/
    start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']

    def parse(self, response):
        print('url:', response.url)

        #open_in_browser(response)  # to see url in web browser

        # download images and convert to JPG (even if it is already JPG)
        for url in response.css('img::attr(src)').extract():
            url = response.urljoin(url)
            image = url.rsplit('/')[-1] # get first char from image name
            yield {'image_urls': [url], 'name': 'books', 'album': image[0]}


# --- pipelines ---

import os

# --- original code ---  # needed only if you use `image_guid`
#import hashlib        
#from scrapy.utils.python import to_bytes
# --- original code ---

class RenameImagePipeline(ImagesPipeline):
    '''Pipeline to change file names - to add folder name'''

    def get_media_requests(self, item, info):
        # --- original code ---
        #for image_url in item['image_urls']:
        #    yield scrapy.Request(image_url)
        # --- original code ---

        for image_url in item['image_urls']:
            # send `meta` to `file_path()`
            yield scrapy.Request(image_url, meta={'name': item['name'], 'album': item['album']})

    def file_path(self, request, response=None, info=None):
        # --- original code ---
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        #return 'full/%s.jpg' % (image_guid,)
        # --- original code ---

        # get `meta`
        name  = request.meta['name']
        album = request.meta['album']
        image = request.url.rsplit('/')[-1]
        #image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        print('file_path:', request.url, request.meta, image) #, image_guid)

        #return '%s/%s/%s.jpg' % (name, album, image_guid)
        return '%s/%s/%s' % (name, album, image)

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #

    # download images to `IMAGES_STORE/full` (standard folder) and convert to JPG (even if it is already JPG)
    # it needs `yield {'image_urls': [url]}` in `parse()` and both ITEM_PIPELINES and IMAGES_STORE to work

    #'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},  # used standard ImagePipeline (download to IMAGES_STORE/full)
    'ITEM_PIPELINES': {'__main__.RenameImagePipeline': 1},             # used Pipeline create in current file (needs __main___)
    #'IMAGES_STORE': '/full/path/to/valid/dir',  # this folder has to exist before downloading
    'IMAGES_STORE': 'Master',  # this folder has to exist before downloading
})

c.crawl(MySpider)
c.start()

您可以找到源代码并查看它在原始ImagePipeline
中的外观。在上面的完整示例中，我在注释中添加了一些原始代码
在Linux上我有
import scrapy
print(scrapy.__file__)

及

顺便说一句：ImagePipeline
如果下载JPG
，则将所有图像压缩到JPG
-事件。如果要保留原始图像，则可能需要FilePipeline
而不是ImagePipeline
。和文件存储
而不是图像存储

    #open_in_browser(response)  # to see url in web browser

    # download images and convert to JPG (even if it is already JPG)
    for url in response.css('img::attr(src)').extract():
        url = response.urljoin(url)
        image = url.rsplit('/')[-1] # get first char from image name
        yield {'image_urls': [url], 'name': 'books', 'album': image[0]}


顺便说一句：管道有时会出现问题，因为它不会被破坏