Python 3.x 如何继承,;是否将自定义属性从spider中间件转换为parse函数?
我想在spider中间件中创建一个Python 3.x 如何继承,;是否将自定义属性从spider中间件转换为parse函数?,python-3.x,web-scraping,scrapy,Python 3.x,Web Scraping,Scrapy,我想在spider中间件中创建一个job\u name内部spider\u打开(…),并使用它: 作为parse函数中的一个yield值写入myscrapy_results表 在我的scrapy_log表中记录scrapy stats时 我通过以下方式实现了#2: middleware.py def spider_opened(self, spider): spider.logger.info('************Spider opened: %s' % spider.name)
job\u name
内部spider\u打开(…)
,并使用它:
scrapy_results
表scrapy_log
表中记录scrapy stats时middleware.py
def spider_opened(self, spider):
spider.logger.info('************Spider opened: %s' % spider.name)
...
self.job_timestamp = int(datetime.datetime.now().timestamp())
self.job_name = spider.name + '_' + str(self.job_timestamp)
def spider_closed(self, spider, reason):
spider.logger.info('************Spider closed: %s, Job: %s, Reason: %s' % (spider.name, self.job_name, str(reason)))
...
insert_log_statement = "insert into scrapy_logs \
values('%s', %s, '%s', %s, %s, %s, %s, %s) " \
% (self.job_name, self.job_timestamp, reason, downloader_request_count, response_received_count, \
elapsed_time_seconds, item_scraped_count, item_dropped_count)
try:
self.cur.execute(insert_log_statement)
except:
print("ERROR!! Could not commit transaction to insert log: ", insert_log_statement)
self.connection.commit()
self.cur.close()
self.connection.close()
class AmazonbotSpider(scrapy.Spider):
...
# job_name = spider.job_name
def parse(self, response):
...
yield {
'text': text,
# 'job_name': self.job_name,
... }
但是对于#1,我想在解析函数中生成作业_名称
,以及如下所示的刮取字段:
from scrapy import signals
from scrapy import Spider
class AmazonbotSpider(Spider):
...
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(AmazonbotSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
return spider
def spider_opened(self, spider):
self.logger.info('************Spider opened: %s' % spider.name)
...
self.job_timestamp = int(datetime.datetime.now().timestamp())
self.job_name = spider.name + '_' + str(self.job_timestamp)
myspider.py
def spider_opened(self, spider):
spider.logger.info('************Spider opened: %s' % spider.name)
...
self.job_timestamp = int(datetime.datetime.now().timestamp())
self.job_name = spider.name + '_' + str(self.job_timestamp)
def spider_closed(self, spider, reason):
spider.logger.info('************Spider closed: %s, Job: %s, Reason: %s' % (spider.name, self.job_name, str(reason)))
...
insert_log_statement = "insert into scrapy_logs \
values('%s', %s, '%s', %s, %s, %s, %s, %s) " \
% (self.job_name, self.job_timestamp, reason, downloader_request_count, response_received_count, \
elapsed_time_seconds, item_scraped_count, item_dropped_count)
try:
self.cur.execute(insert_log_statement)
except:
print("ERROR!! Could not commit transaction to insert log: ", insert_log_statement)
self.connection.commit()
self.cur.close()
self.connection.close()
class AmazonbotSpider(scrapy.Spider):
...
# job_name = spider.job_name
def parse(self, response):
...
yield {
'text': text,
# 'job_name': self.job_name,
... }
如何访问MySpider
类中的myscraper.middleware.MySpiderMiddleware.job_name
字段
def spider_opened(self, spider):
spider.logger.info('************Spider opened: %s' % spider.name)
...
self.job_timestamp = int(datetime.datetime.now().timestamp())
self.job_name = spider.name + '_' + str(self.job_timestamp)
这是中间件中的方法,您必须记住中间件是它自己的一个类,而不是spider的一部分,因此self.job\u name
将值分配给中间件(self
),而不是spider
由于方法接收的spider
参数是spider的实例,因此可以直接将其赋值,如下所示:
spider.job_name = spider.name + '_' + str(self.job_timestamp)
这会管用的,但我觉得很烦人。。。我的建议是在爬行器本身中分配此值,您可以在\uuuu init\uuuu
方法或中这样做:
from scrapy import signals
from scrapy import Spider
class AmazonbotSpider(Spider):
...
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(AmazonbotSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
return spider
def spider_opened(self, spider):
self.logger.info('************Spider opened: %s' % spider.name)
...
self.job_timestamp = int(datetime.datetime.now().timestamp())
self.job_name = spider.name + '_' + str(self.job_timestamp)