Scrapy 刮痕图像管道:如何在校验和上删除图像?

Scrapy 刮痕图像管道:如何在校验和上删除图像?,scrapy,scrapy-pipeline,Scrapy,Scrapy Pipeline,我正在使用scrapy图像管道刮取一些图像,并希望从导入中删除与某个哈希匹配的图像 类别MyImagesPipeline(ImagesPipeline): 图像: url "https://www.example.de…212-B726-757P-A20D-1.jpg" path "full/56de72acb6c1e12ffa8644c1bb96df4edf421438.jpg" checksum "e206446c40c22cfd5f94966c337b56cc" 如何确保在导

我正在使用scrapy图像管道刮取一些图像,并希望从导入中删除与某个哈希匹配的图像

类别MyImagesPipeline(ImagesPipeline):

图像:

url "https://www.example.de…212-B726-757P-A20D-1.jpg"
path    "full/56de72acb6c1e12ffa8644c1bb96df4edf421438.jpg"
checksum    "e206446c40c22cfd5f94966c337b56cc"

如何确保在导入中排除此图像?

您可以尝试覆盖imagepipeline中的get\u images方法。如果哈希匹配,则不会下载图像

    import logging
    from io import BytesIO
    from scrapy.utils.misc import md5sum

    logger = logging.getLogger(__name__)

    def get_images(self, response, request, info):
        checksum = md5sum(BytesIO(response.body))
        drop_list = ['hash1', 'hash2']
        logger.debug('Verifying Checksum: {}'.format(checksum))
        if checksum in drop_list:
            logger.debug('Skipping Checksum: {}'.format(checksum))
            raise Exception('Dropping Image')

        return super(MyImagesPipeline,self).get_images(response, request, info)

谢谢你,萨钦。我试图实施,但没有任何效果。函数必须放在类MyImagesPipeline(ImagesPipeline)中,对吗?我将您的下拉列表替换为:self.pic\u dummy\u checksum,在spider中:pic\u dummy\u checksum=['671f9f34aa31bdf6bf2aeb78e16db060',…我如何记录删除?尝试引发异常。我已编辑了我的答案。希望这有帮助。将代码放入MyImagesPipeline类后,日志文件中没有条目。您从何处获得此信息?我在下载的媒体上根本找不到任何文档:没有文档。我指的是文件管道的残片代码。是否设置了正确的日志级别?请尝试在cmd上手动打印和运行。
    import logging
    from io import BytesIO
    from scrapy.utils.misc import md5sum

    logger = logging.getLogger(__name__)

    def get_images(self, response, request, info):
        checksum = md5sum(BytesIO(response.body))
        drop_list = ['hash1', 'hash2']
        logger.debug('Verifying Checksum: {}'.format(checksum))
        if checksum in drop_list:
            logger.debug('Skipping Checksum: {}'.format(checksum))
            raise Exception('Dropping Image')

        return super(MyImagesPipeline,self).get_images(response, request, info)