Python 3.x 如何继承,;是否将自定义属性从spider中间件转换为parse函数?

Python 3.x 如何继承,;是否将自定义属性从spider中间件转换为parse函数?,python-3.x,web-scraping,scrapy,Python 3.x,Web Scraping,Scrapy,我想在spider中间件中创建一个job\u name内部spider\u打开(…),并使用它: 作为parse函数中的一个yield值写入myscrapy_results表 在我的scrapy_log表中记录scrapy stats时 我通过以下方式实现了#2: middleware.py def spider_opened(self, spider): spider.logger.info('************Spider opened: %s' % spider.name)

我想在spider中间件中创建一个
job\u name
内部
spider\u打开(…)
,并使用它:

  • 作为parse函数中的一个yield值写入my
    scrapy_results
  • 在我的
    scrapy_log
    表中记录scrapy stats时
  • 我通过以下方式实现了#2:

    middleware.py

    def spider_opened(self, spider):
        spider.logger.info('************Spider opened: %s' % spider.name)
        ...
        self.job_timestamp = int(datetime.datetime.now().timestamp())
        self.job_name = spider.name + '_' + str(self.job_timestamp)
    
    
    def spider_closed(self, spider, reason):
        spider.logger.info('************Spider closed: %s, Job: %s, Reason: %s' % (spider.name, self.job_name, str(reason)))
        ...
        insert_log_statement = "insert into scrapy_logs \
                values('%s', %s, '%s', %s, %s, %s, %s, %s) " \
                    % (self.job_name, self.job_timestamp, reason, downloader_request_count, response_received_count, \
                        elapsed_time_seconds, item_scraped_count, item_dropped_count)
        try:
            self.cur.execute(insert_log_statement)
        except:
            print("ERROR!! Could not commit transaction to insert log: ", insert_log_statement)
    
        self.connection.commit()
        self.cur.close()
        self.connection.close()
    
    class AmazonbotSpider(scrapy.Spider):
        ...
        # job_name = spider.job_name
    
        def parse(self, response):
            ...
            yield {
                   'text': text,
                   # 'job_name': self.job_name,
                   ... }    
    
    但是对于#1,我想在解析函数中生成
    作业_名称
    ,以及如下所示的刮取字段:

    from scrapy import signals
    from scrapy import Spider
    
    class AmazonbotSpider(Spider):
        ...
        
        @classmethod
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super(AmazonbotSpider, cls).from_crawler(crawler, *args, **kwargs)
            crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
            return spider
    
        def spider_opened(self, spider):
            self.logger.info('************Spider opened: %s' % spider.name)
            ...
            self.job_timestamp = int(datetime.datetime.now().timestamp())
            self.job_name = spider.name + '_' + str(self.job_timestamp)
    
    myspider.py

    def spider_opened(self, spider):
        spider.logger.info('************Spider opened: %s' % spider.name)
        ...
        self.job_timestamp = int(datetime.datetime.now().timestamp())
        self.job_name = spider.name + '_' + str(self.job_timestamp)
    
    
    def spider_closed(self, spider, reason):
        spider.logger.info('************Spider closed: %s, Job: %s, Reason: %s' % (spider.name, self.job_name, str(reason)))
        ...
        insert_log_statement = "insert into scrapy_logs \
                values('%s', %s, '%s', %s, %s, %s, %s, %s) " \
                    % (self.job_name, self.job_timestamp, reason, downloader_request_count, response_received_count, \
                        elapsed_time_seconds, item_scraped_count, item_dropped_count)
        try:
            self.cur.execute(insert_log_statement)
        except:
            print("ERROR!! Could not commit transaction to insert log: ", insert_log_statement)
    
        self.connection.commit()
        self.cur.close()
        self.connection.close()
    
    class AmazonbotSpider(scrapy.Spider):
        ...
        # job_name = spider.job_name
    
        def parse(self, response):
            ...
            yield {
                   'text': text,
                   # 'job_name': self.job_name,
                   ... }    
    
    如何访问
    MySpider
    类中的
    myscraper.middleware.MySpiderMiddleware.job_name
    字段

    def spider_opened(self, spider):
        spider.logger.info('************Spider opened: %s' % spider.name)
        ...
        self.job_timestamp = int(datetime.datetime.now().timestamp())
        self.job_name = spider.name + '_' + str(self.job_timestamp)
    
    这是中间件中的方法,您必须记住中间件是它自己的一个类,而不是spider的一部分,因此
    self.job\u name
    将值分配给中间件(
    self
    ),而不是spider

    由于方法接收的
    spider
    参数是spider的实例,因此可以直接将其赋值,如下所示:

    spider.job_name = spider.name + '_' + str(self.job_timestamp)
    
    这会管用的,但我觉得很烦人。。。我的建议是在爬行器本身中分配此值,您可以在
    \uuuu init\uuuu
    方法中这样做:

    from scrapy import signals
    from scrapy import Spider
    
    class AmazonbotSpider(Spider):
        ...
        
        @classmethod
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super(AmazonbotSpider, cls).from_crawler(crawler, *args, **kwargs)
            crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
            return spider
    
        def spider_opened(self, spider):
            self.logger.info('************Spider opened: %s' % spider.name)
            ...
            self.job_timestamp = int(datetime.datetime.now().timestamp())
            self.job_name = spider.name + '_' + str(self.job_timestamp)