Python 如何在动态文件夹中下载基于_Python_Scrapy

Python 如何在动态文件夹中下载基于

python scrapy

Python 如何在动态文件夹中下载基于,python,scrapy,Python,Scrapy,我试图将默认路径full/hash.jpg覆盖到/hash.jpg，我尝试使用以下代码： def item_completed(self, results, item, info): for result in [x for ok, x in results if ok]: path = result['path'] # here we create the session-path where the files should be in the en

我试图将默认路径

full/hash.jpg

覆盖到

/hash.jpg

，我尝试使用以下代码：

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        # here we create the session-path where the files should be in the end
        # you'll have to change this path creation depending on your needs
        slug = slugify(item['category'])
        target_path = os.path.join(slug, os.path.basename(path))

        # try to move the file and raise exception if not possible
        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

但我得到：

Traceback (most recent call last):
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
    self.callback(self.resultList)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
    self._startRunCallbacks(result)
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
    --- <exception caught here> ---
    File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
    File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
    if not os.rename(path, target_path):
    exceptions.OSError: [Errno 2] No such file or directory

回溯（最近一次呼叫最后一次）：
文件“/home/user/.venv/sepid/lib/python2.7/site packages/twisted/internet/defer.py”，第577行，在运行回调中
current.result=回调（current.result，*args，**kw）
文件“/home/user/.venv/sepid/lib/python2.7/site packages/twisted/internet/defer.py”，第839行，在
self.callback（self.resultList）
文件“/home/user/.venv/sepid/lib/python2.7/site packages/twisted/internet/defer.py”，第382行，在回调中
自启动返回（结果）
文件“/home/user/.venv/sepid/lib/python2.7/site packages/twisted/internet/defer.py”，第490行，在startRunCallbacks中
self.\u runCallbacks（）
---  ---
文件“/home/user/.venv/sepid/lib/python2.7/site packages/twisted/internet/defer.py”，第577行，在运行回调中
current.result=回调（current.result，*args，**kw）
文件“/home/user/Projects/sepid/scraper/scraper/pipelines.py”，第44行，项目_已完成
如果不是os.rename（路径，目标路径）：
exceptions.OSError:[Errno 2]没有这样的文件或目录

我不知道出了什么问题，还有没有其他方法可以改变这条路？谢谢

由于dst文件夹不存在而引发问题，快速解决方案是：

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

@neelix给出的解决方案是最好的，但我正在尝试使用它，我发现了一些奇怪的结果，一些文档被移动了，但不是所有的文档。所以我替换了：

if not os.rename(path, target_path):
            raise DropItem("Could not move image to target folder")

我导入了shutil库，然后我的代码是：

def item_completed(self, results, item, info):

    for result in [x for ok, x in results if ok]:
        path = result['path']
        slug = slugify(item['designer'])


        settings = get_project_settings()
        storage = settings.get('IMAGES_STORE')

        target_path = os.path.join(storage, slug, os.path.basename(path))
        path = os.path.join(storage, path)

        # If path doesn't exist, it will be created
        if not os.path.exists(os.path.join(storage, slug)):
            os.makedirs(os.path.join(storage, slug))

        shutil.move(path, target_path)

    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

我希望它也适用于你们：）

我创建了一个从

imagesipepeline

继承的管道，并重写了

file\u path

方法，并使用它代替了标准的

imagesipeline

class StoreImgPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)

为了在下载图像之前动态设置scrapy spider下载的图像的路径，而不是在下载图像之后移动它们，我创建了一个自定义管道，覆盖

get\u media\u请求

和

file\u路径

方法

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]

    def file_path(self, request, response=None, info=None):
        ## start of deprecation warning block (can be removed in the future)
        def _warn():
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                      'please use file_path(request, response=None, info=None) instead',
                      category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from image_key or file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() or image_key() methods have been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        elif not hasattr(self.image_key, '_base'):
            _warn()
            return self.image_key(url)
        ## end of deprecation warning block

        image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
        return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)

这种方法假设您在spider中定义了一个

scrapy.Item

，并用特定的字段名替换，例如，“field1”。在

get\u media\u requests

中设置Request.meta允许在为每个项目设置下载目录时使用项目字段值，如

file\u path

的返回语句所示。如果目录不存在，Scrapy将自动创建目录

自定义管道类定义保存在我的项目的

pipelines.py

中。这里的方法直接改编自默认的scrapy管道

images.py

，它在我的Mac上存储在

~/anaconda3/pkgs/scrapy-1.5.0-py36\u 0/lib/python3.6/site packages/scrapy/pipelines/

中。可以根据需要从该文件复制包含和其他方法。

是否可以打印path变量并验证其是否为有效路径？您还可以复制完整的错误回溯吗？我想os.rename（）就是问题所在？我添加了完整的回溯。我还打印了路径：

full/bc404f7f5e2ef9732d96d349f87cc66fa9f4479f.jpg

和

hola/bc404f7f5e2ef9732d96d349f87cc66fa9f4479f.jpg

。我同意你的看法，我认为问题在于操作系统的重命名（）。路径不应该是绝对的吗？谢谢你在windows上是偶然的吗？在Windows上，从os.rename说，如果dst已经存在，即使它是一个文件，也会引发OSError；当dst命名现有文件时，可能无法实现原子重命名。所以可能是这个文件已经存在，它给了你一个误导性的错误？不，我在linux上。您提到了一件好事，但如果dst文件夹不存在呢？os.rename会创建它吗？我应该检查一下。我研究并理解了[]中结果的表达式

和结果中的x
，但未能找到任何文档来帮助我理解[x表示ok，x表示ok]中结果的条件：
的含义。我已经搜索了“在python中搜索语句”和列表if var中的var。有人能给我指出正确的方向吗？你应该搜索列表理解
，请阅读puthon文档。我希望它能解决您的问题，如果仍然不清楚，请告诉我。我仍然收到错误[Errno 2]没有这样的文件或目录
我正在写一个新问题，其中包含我的代码。我不明白为什么用python代码包含导入总是一件好事。为什么不在def file\u path
中更改图像的路径/名称，而不是在def item\u completed
中执行，因为item\u completed
已经下载了图像，虽然在图像下载之前调用了def file_path
，但请向我们显示所有文件code item.py、spider.py和所有重要文件。我是scrapy新手，此代码不适用于我。您是否将scrapy与django一起使用？可能您忘记了将StoreImgPipeline添加到管道中，而不是在spider配置中添加origin ImagesPipeline。否！我只是简单地使用scrapy start项目，就像一个基本的刮刀，但我想将所有图像分类到Diffent文件夹中，一切都很好，它仍然在/full/folder中。这个“realty sc”是什么？它是我应该创建的一个目录，对吗？“realty-sc”它只是我想存储图像的文件夹。scrapy应该自动创建它。file_path函数只返回保存刮取文件的路径，您可以随意设置。谢谢。它在使用映射时起作用，但在其他方面不起作用。