Python Scrapy API-通过自定义记录器
我使用API从脚本(Python 3.5,Scrapy 1.5)运行Scrapy 主脚本调用函数来处理其日志记录:Python Scrapy API-通过自定义记录器,python,logging,scrapy,Python,Logging,Scrapy,我使用API从脚本(Python 3.5,Scrapy 1.5)运行Scrapy 主脚本调用函数来处理其日志记录: def main(target_year): project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0] iso_run_date = datetime.date.today().isoformat() logger = utils.get_logger(project
def main(target_year):
project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0]
iso_run_date = datetime.date.today().isoformat()
logger = utils.get_logger(project, iso_run_date)
scraping.run(project, iso_run_date, target_year)
以下是文件“utils.py”中的函数,其中包含一个用于格式化的附加类,该类使用Python的日志库创建记录器:
class UTCFormatter(logging.Formatter):
converter = time.gmtime
def get_logger(project, iso_run_date):
ip_address_param = 'ip'
logger = logging.getLogger(project)
logger.setLevel(logging.DEBUG)
file_handler = logging.FileHandler(os.path.abspath(os.path.join(
'log', '{}_{}.log'.format(project, iso_run_date))))
file_handler.setLevel(logging.DEBUG)
formatter = UTCFormatter(
fmt=('[%(asctime)s.%(msecs)03dZ] %({})s %(name)s %(levelname)s: '
'%(message)s').format(ip_address_param),
datefmt='%Y-%m-%dT%H:%M:%S')
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
logger = logging.LoggerAdapter(
logger, {ip_address_param: socket.gethostbyname(socket.gethostname())})
return logger
这是Scrapy目录中的文件“\uuuu init\uuuuuu.py”:
@twisted.internet.defer.inlineCallbacks
def crawl(crawler_process, project, iso_run_date, target_year):
yield crawler_process.crawl(project, iso_run_date, target_year)
def run(project, iso_run_date, target_year):
os.environ.setdefault(
'SCRAPY_SETTINGS_MODULE', 'scraping.scraping.settings')
crawler_process = scrapy.crawler.CrawlerProcess(
scrapy.utils.project.get_project_settings())
crawl(crawler_process, project, iso_run_date, target_year)
crawler_process.start()
当我执行脚本时,我从输出日志文件中的主脚本中获取日志,但从Scrapy中什么也得不到
当我将此添加到蜘蛛中时:
self.logger.debug('Test')
我得到这个错误:
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 980, in emit
msg = self.format(record)
File "/usr/lib/python3.5/logging/__init__.py", line 830, in format
return fmt.format(record)
File "/usr/lib/python3.5/logging/__init__.py", line 570, in format
s = self.formatMessage(record)
File "/usr/lib/python3.5/logging/__init__.py", line 539, in formatMessage
return self._style.format(record)
File "/usr/lib/python3.5/logging/__init__.py", line 383, in format
return self._fmt % record.__dict__
KeyError: 'ip'
Call stack:
File "XXXXX.py", line 105, in <module>
main(target_year)
File "XXXXX.py", line 23, in main
scraping.run(project, iso_run_date, target_year)
File "/home/XYZ/virtualenvs/scraping/project/scraping/__init__.py", line 27, in run
crawler_process.start()
File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/scrapy/crawler.py", line 291, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/twisted/internet/base.py", line 1261, in run
self.mainLoop()
File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/twisted/internet/base.py", line 1270, in mainLoop
self.runUntilCurrent()
File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/twisted/internet/base.py", line 896, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
request = next(slot.start_requests)
File "/home/XYZ/virtualenvs/scraping/project/scraping/scraping/spiders/XXXXX.py", line 47, in start_requests
self.logger.debug('Test')
Message: 'Test'
Arguments: ()
---日志记录错误---
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python3.5/logging/_init_uuu.py”,第980行,在emit中
msg=self.format(记录)
文件“/usr/lib/python3.5/logging/_init__uuu.py”,第830行,格式为
返回格式(记录)
文件“/usr/lib/python3.5/logging/_init__uuu.py”,第570行,格式为
s=self.formatMessage(记录)
formatMessage中的文件“/usr/lib/python3.5/logging/_init__uuu.py”,第539行
返回self.\u样式格式(记录)
文件“/usr/lib/python3.5/logging/_init__.py”,第383行,格式为
返回自记录的百分比。\u口述__
KeyError:'ip'
调用堆栈:
文件“XXXXX.py”,第105行,在
主要(目标年)
文件“XXXXX.py”,第23行,主
刮削运行(项目、iso运行日期、目标年)
文件“/home/XYZ/virtualenvs/scraping/project/scraping/_init__;u.py”,第27行,运行中
爬网程序_进程。开始()
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/scrapy/crawler.py”,第291行,在开始处
reactor.run(installSignalHandlers=False)#阻止调用
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/twisted/internet/base.py”,第1261行,正在运行
self.mainLoop()
mainLoop中的文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/twisted/internet/base.py”,第1270行
self.rununtlcurrent()
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/twisted/internet/base.py”,第896行,在rununtlcurrent中
call.func(*call.args,**call.kw)
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/scrapy/utils/reactor.py”,第41行,在调用中__
返回self.\u func(*self.\u a,**self.\u kw)
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/scrapy/core/engine.py”,第127行,在下一个请求中
请求=下一个(插槽启动请求)
文件“/home/XYZ/virtualenvs/scraping/project/scraping/scraping/spider/XXXXX.py”,第47行,在start\u请求中
self.logger.debug('Test')
消息:“测试”
参数:()
当我在我的主脚本中使用basicConfig时,一切都很好,Scrapy似乎只会选择这个基本的记录器。但是由于附加的格式,我需要使用更高级的代码来记录日志
我希望能够从我的主脚本中定义一个自定义记录器,如代码所示,并让Scrapy对相同的输出文件使用相同的格式,而无需再次重新定义所有这些。这可能吗?我找到了一种似乎有效的方法 主脚本保持不变:
def main(target_year):
project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0]
iso_run_date = datetime.date.today().isoformat()
logger = utils.get_logger(project, iso_run_date)
@twisted.internet.defer.inlineCallbacks
def crawl(crawler_process, project, iso_run_date, target_year):
yield crawler_process.crawl(project, iso_run_date, target_year)
def run(project, iso_run_date, target_year):
os.environ.setdefault(
'SCRAPY_SETTINGS_MODULE', 'scraping.scraping.settings')
crawler_process = scrapy.crawler.CrawlerProcess(
scrapy.utils.project.get_project_settings())
crawl(crawler_process, project, iso_run_date, target_year)
crawler_process.start()
文件“utils.py”现在只使用logging.basicConfig(),就像我以前做的那样
def get_logger(project, iso_run_date):
logging.basicConfig(
filename=os.path.abspath(os.path.join('log', '{0}_{1}.log'.format(project, iso_run_date))),
format='[%(asctime)s.%(msecs)03dZ] {0} %(name)s %(levelname)s: %(message)s'.format(socket.gethostbyname(socket.gethostname())),
datefmt='%Y-%m-%dT%H:%M:%S',
level=logging.DEBUG)
logging.Formatter.converter = time.gmtime
return logging.getLogger(project)
文件“\uuuu init\uuuuu.py”也没有更改:
def main(target_year):
project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0]
iso_run_date = datetime.date.today().isoformat()
logger = utils.get_logger(project, iso_run_date)
@twisted.internet.defer.inlineCallbacks
def crawl(crawler_process, project, iso_run_date, target_year):
yield crawler_process.crawl(project, iso_run_date, target_year)
def run(project, iso_run_date, target_year):
os.environ.setdefault(
'SCRAPY_SETTINGS_MODULE', 'scraping.scraping.settings')
crawler_process = scrapy.crawler.CrawlerProcess(
scrapy.utils.project.get_project_settings())
crawl(crawler_process, project, iso_run_date, target_year)
crawler_process.start()
现在,所有日志消息都以相同的自定义格式输出到相同的日志文件中。此外:
- 来自主脚本的日志使用变量“project”作为名称
- 来自spider的日志使用spider名称作为名称
- 来自Scrapy components的日志使用Scrapy components a名称