Python 刮擦管道工艺\u项目不工作（另一个不工作）_Python_Scrapy

Python 刮擦管道工艺\u项目不工作（另一个不工作）

python scrapy

Python 刮擦管道工艺\u项目不工作（另一个不工作）,python,scrapy,Python,Scrapy,项目不工作问题。但是我已经尽了最大的努力做研究，我仍然要提出一个问题。我将代码简化如下简而言之，我想从网站上获得一些产品的详细信息，我必须使用splash使自己能够阅读一些css。我已经注册了一个项和两个db类，我的计划是将产品存储在product表中，并将它们的映像路径存储在另一个表中但是，最终图像已下载，但项目管道尚未触发在我的管道中，我只能获得两张照片，--> 管道初始化：结束初始化虽然我能得到图片，但我不能得到我的信息打印（“管道”+图像\u url）最重要的是 pip

项目不工作问题。但是我已经尽了最大的努力做研究，我仍然要提出一个问题。我将代码简化如下

简而言之，我想从网站上获得一些产品的详细信息，我必须使用splash使自己能够阅读一些css。我已经注册了一个项和两个db类，我的计划是将产品存储在product表中，并将它们的映像路径存储在另一个表中

但是，最终图像已下载，但项目管道尚未触发

在我的管道中，我只能获得两张照片，-->

管道初始化：
结束初始化

虽然我能得到图片，但我不能得到我的信息

打印（“管道”+图像\u url）

最重要的是

pipeline.py
=============
from sqlalchemy.orm import sessionmaker
from scrapy.exceptions import DropItem
from itembot.database.models import Products, db_connect, create_products_table
from scrapy.pipelines.images import ImagesPipeline

class ImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item["image_urls"]:
             print("pipeline" + image_url)
        yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x["path"] for ok, x in results if ok]
        print("imagepath" + image_paths)
        if not image_paths:
            raise DropItem("Item contains no images")
        item["image_paths"] = image_paths
        return item


class ItembotPipeline(object):

    def __init__(self):

        print("pipeline inited: " )
        engine = db_connect()
        create_products_table(engine)
        self.Session = sessionmaker(bind=engine)
        print("end init")

    def process_item(self, item, spider):
        print("pipeline Entered : ",item )

        print("pipeline Entered : item is products ",item )
        products = Products(**item)
        try:
            session = self.Session()
            print("pipeline adding : ",item )
            session.add(products)
            session.commit()
            print("pipeline commited : ",item )
            session.refresh(products)
            item[id] = products[id]
            yield item[id]
        except:
            session.rollback()
            raise
        finally:
        session.close()
        if(products[id] is not None):
        print("pipeline 2if: ",item )
        productsphotos = ProductsPhotos(**item)
        try:
        session = self.Session()
        session.add(productsphotos)
        session.commit()
        session.refresh(productsphotos)
        except:
        session.rollback()
        raise
        finally:
        session.close()
        return item

最重要的是，蜘蛛

    import scrapy
    from scrapy.loader import ItemLoader
    from scrapy import Request
    from w3lib.html import remove_tags
    import re
    from ..database.models import Products
    from itembot.items import ItembotItem
    from scrapy_splash import SplashRequest

    class FreeitemSpider(scrapy.Spider):
                name = "freeitem"

        start_urls = [
        "https://google.com.hk" ,
        ]
        def parse(self, response):
        yield SplashRequest(url=response.url, callback=self.parse_product, args={"wait": 0.5})

        def parse_product(self, response):
        products = response.css(" div.classified-body.listitem.classified-summary")

        c = 0
        item = []
        for product in products:
            item = ItembotItem()
            imageurl = {}
            fullurls=[]
            item["title"]= product.css("h4.R a::text").extract_first()

            pc = product.css("div#gallery"+str(c) + " ul a::attr(href)").extract()
            for link in pc:
                 fullurls.append(response.urljoin(link))
            item["image_urls"]= fullurls
            url = product.css("a.button-tiny-short.R::attr(href)").extract_first()
            item["webURL"]= response.urljoin(url)
            c = c+1

            yield [item]

这是我的物品

import scrapy
class ItembotItem(scrapy.Item):
    id = scrapy.Field(default"null")
    title = scrapy.Field(default="null")
    details = scrapy.Field(default="null")
    webURL = scrapy.Field(default="null")
    images = scrapy.Field(default="null")
    image_urls = scrapy.Field(default="null")


class ProductsPhotos(DeclarativeBase):
__tablename__ = "products_photos"
    id = Column(Integer, primary_key=True)
    product_ID = Column(ForeignKey(Products.id),nullable=False)
    photo_path = Column(String(200))

    parent = relationship(Products, load_on_pending=True)

设置.py

ITEM_PIPELINES = {
"itembot.pipelines.ItembotPipeline": 300,
"scrapy.pipelines.images.ImagesPipeline": 1,
}
IMAGES_STORE = "./photo"

model.py

class Products(DeclarativeBase):
__tablename__ = "products"
id = Column(Integer, primary_key=True)
title = Column(String(300))

webURL = Column(String(200))
def __str__(self):
return self.title

class ProductsPhotos(DeclarativeBase):
__tablename__ = "products_photos"
id = Column(Integer, primary_key=True)
product_ID = Column(ForeignKey(Products.id),nullable=False)
photo_path = Column(String(200))

parent = relationship(Products, load_on_pending=True)

我发现一个大错误可以解释你的问题

首先

class ImagesPipeline(ImagesPipeline)

不要为自己的类使用与父类相同的名称

最好

class MyImagesPipeline(ImagesPipeline)

现在是你的主要错误

ITEM_PIPELINES = {
   ...
   "scrapy.pipelines.images.ImagesPipeline": 1,
}

您可以使用

scrapy.pipelines.images

，
不是来自

itembot.pipelines的imagesipeline
（myimagesipeline
）
因此，它下载图像，但不运行print（“管道”+image\u url）

应该是
ITEM_PIPELINES = {
    ...
    "itembot.pipelines.ImagesPipeline": 1,
}

或者如果您使用namemyimagesipeline

ITEM_PIPELINES = {
    ...
    "itembot.pipelines.MyImagesPipeline": 1,
}

使用按钮{}
更正LU格式代码。现在您有错误的缩进，看起来您有产品的yield[item]
外部
循环。谢谢！我只是在这里纠正了这个错误，我并不想在我的代码中出现这个错误。谢谢……非常感谢。我开始收到更多的错误消息，但至少我知道我正在处理中。所以，我的下一个问题是找出为什么另一条管道没有着火。。。谢谢但是，我在get_media_请求中得到了打印输出，但在item_completed中没有得到。。我基本上使用了提供的代码Class FImagesPipeline（ImagesPipeline）：def get_media_requests（self，item，info）：print（“在管道中”）作为项目['image_url']中的图像url:yield scrapy.Request（image_url）def item_completed（self，results，item，info）：图像路径=[x['path']表示ok，结果中的x表示ok]如果不是图像路径：raise DropItem（“项目不包含图像”）项目['image\u LocalPath']=图像路径打印（“管道”）产量（项目）