Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/android/211.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy API-通过自定义记录器_Python_Logging_Scrapy - Fatal编程技术网

Python Scrapy API-通过自定义记录器

Python Scrapy API-通过自定义记录器,python,logging,scrapy,Python,Logging,Scrapy,我使用API从脚本(Python 3.5,Scrapy 1.5)运行Scrapy 主脚本调用函数来处理其日志记录: def main(target_year): project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0] iso_run_date = datetime.date.today().isoformat() logger = utils.get_logger(project

我使用API从脚本(Python 3.5,Scrapy 1.5)运行Scrapy

主脚本调用函数来处理其日志记录:

def main(target_year):
    project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0]
    iso_run_date = datetime.date.today().isoformat()
    logger = utils.get_logger(project, iso_run_date)

    scraping.run(project, iso_run_date, target_year)
以下是文件“utils.py”中的函数,其中包含一个用于格式化的附加类,该类使用Python的日志库创建记录器:

class UTCFormatter(logging.Formatter):
    converter = time.gmtime


def get_logger(project, iso_run_date):
    ip_address_param = 'ip'
    logger = logging.getLogger(project)
    logger.setLevel(logging.DEBUG)
    file_handler = logging.FileHandler(os.path.abspath(os.path.join(
        'log', '{}_{}.log'.format(project, iso_run_date))))
    file_handler.setLevel(logging.DEBUG)
    formatter = UTCFormatter(
        fmt=('[%(asctime)s.%(msecs)03dZ] %({})s %(name)s %(levelname)s: '
             '%(message)s').format(ip_address_param),
        datefmt='%Y-%m-%dT%H:%M:%S')
    file_handler.setFormatter(formatter)
    logger.addHandler(file_handler)
    logger = logging.LoggerAdapter(
        logger, {ip_address_param: socket.gethostbyname(socket.gethostname())})
    return logger
这是Scrapy目录中的文件“\uuuu init\uuuuuu.py”:

@twisted.internet.defer.inlineCallbacks
def crawl(crawler_process, project, iso_run_date, target_year):
    yield crawler_process.crawl(project, iso_run_date, target_year)


def run(project, iso_run_date, target_year):
    os.environ.setdefault(
        'SCRAPY_SETTINGS_MODULE', 'scraping.scraping.settings')
    crawler_process = scrapy.crawler.CrawlerProcess(
        scrapy.utils.project.get_project_settings())
    crawl(crawler_process, project, iso_run_date, target_year)
    crawler_process.start()
当我执行脚本时,我从输出日志文件中的主脚本中获取日志,但从Scrapy中什么也得不到

当我将此添加到蜘蛛中时:

self.logger.debug('Test')
我得到这个错误:

--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 980, in emit
    msg = self.format(record)
  File "/usr/lib/python3.5/logging/__init__.py", line 830, in format
    return fmt.format(record)
  File "/usr/lib/python3.5/logging/__init__.py", line 570, in format
    s = self.formatMessage(record)
  File "/usr/lib/python3.5/logging/__init__.py", line 539, in formatMessage
    return self._style.format(record)
  File "/usr/lib/python3.5/logging/__init__.py", line 383, in format
    return self._fmt % record.__dict__
KeyError: 'ip'
Call stack:
  File "XXXXX.py", line 105, in <module>
    main(target_year)
  File "XXXXX.py", line 23, in main
    scraping.run(project, iso_run_date, target_year)
  File "/home/XYZ/virtualenvs/scraping/project/scraping/__init__.py", line 27, in run
    crawler_process.start()
  File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/scrapy/crawler.py", line 291, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/twisted/internet/base.py", line 1261, in run
    self.mainLoop()
  File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/twisted/internet/base.py", line 1270, in mainLoop
    self.runUntilCurrent()
  File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/twisted/internet/base.py", line 896, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/home/XYZ/virtualenvs/scraping/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/home/XYZ/virtualenvs/scraping/project/scraping/scraping/spiders/XXXXX.py", line 47, in start_requests
    self.logger.debug('Test')
Message: 'Test'
Arguments: ()
---日志记录错误---
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python3.5/logging/_init_uuu.py”,第980行,在emit中
msg=self.format(记录)
文件“/usr/lib/python3.5/logging/_init__uuu.py”,第830行,格式为
返回格式(记录)
文件“/usr/lib/python3.5/logging/_init__uuu.py”,第570行,格式为
s=self.formatMessage(记录)
formatMessage中的文件“/usr/lib/python3.5/logging/_init__uuu.py”,第539行
返回self.\u样式格式(记录)
文件“/usr/lib/python3.5/logging/_init__.py”,第383行,格式为
返回自记录的百分比。\u口述__
KeyError:'ip'
调用堆栈:
文件“XXXXX.py”,第105行,在
主要(目标年)
文件“XXXXX.py”,第23行,主
刮削运行(项目、iso运行日期、目标年)
文件“/home/XYZ/virtualenvs/scraping/project/scraping/_init__;u.py”,第27行,运行中
爬网程序_进程。开始()
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/scrapy/crawler.py”,第291行,在开始处
reactor.run(installSignalHandlers=False)#阻止调用
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/twisted/internet/base.py”,第1261行,正在运行
self.mainLoop()
mainLoop中的文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/twisted/internet/base.py”,第1270行
self.rununtlcurrent()
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/twisted/internet/base.py”,第896行,在rununtlcurrent中
call.func(*call.args,**call.kw)
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/scrapy/utils/reactor.py”,第41行,在调用中__
返回self.\u func(*self.\u a,**self.\u kw)
文件“/home/XYZ/virtualenvs/scraping/lib/python3.5/site packages/scrapy/core/engine.py”,第127行,在下一个请求中
请求=下一个(插槽启动请求)
文件“/home/XYZ/virtualenvs/scraping/project/scraping/scraping/spider/XXXXX.py”,第47行,在start\u请求中
self.logger.debug('Test')
消息:“测试”
参数:()
当我在我的主脚本中使用basicConfig时,一切都很好,Scrapy似乎只会选择这个基本的记录器。但是由于附加的格式,我需要使用更高级的代码来记录日志


我希望能够从我的主脚本中定义一个自定义记录器,如代码所示,并让Scrapy对相同的输出文件使用相同的格式,而无需再次重新定义所有这些。这可能吗?

我找到了一种似乎有效的方法

主脚本保持不变:

def main(target_year):
    project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0]
    iso_run_date = datetime.date.today().isoformat()
    logger = utils.get_logger(project, iso_run_date)
@twisted.internet.defer.inlineCallbacks
def crawl(crawler_process, project, iso_run_date, target_year):
    yield crawler_process.crawl(project, iso_run_date, target_year)


def run(project, iso_run_date, target_year):
    os.environ.setdefault(
        'SCRAPY_SETTINGS_MODULE', 'scraping.scraping.settings')
    crawler_process = scrapy.crawler.CrawlerProcess(
        scrapy.utils.project.get_project_settings())
    crawl(crawler_process, project, iso_run_date, target_year)
    crawler_process.start()
文件“utils.py”现在只使用logging.basicConfig(),就像我以前做的那样

def get_logger(project, iso_run_date):
    logging.basicConfig(
        filename=os.path.abspath(os.path.join('log', '{0}_{1}.log'.format(project, iso_run_date))),
        format='[%(asctime)s.%(msecs)03dZ] {0} %(name)s %(levelname)s: %(message)s'.format(socket.gethostbyname(socket.gethostname())),
        datefmt='%Y-%m-%dT%H:%M:%S',
        level=logging.DEBUG)
    logging.Formatter.converter = time.gmtime
    return logging.getLogger(project)
文件“\uuuu init\uuuuu.py”也没有更改:

def main(target_year):
    project = os.path.splitext(os.path.basename(os.path.abspath(__file__)))[0]
    iso_run_date = datetime.date.today().isoformat()
    logger = utils.get_logger(project, iso_run_date)
@twisted.internet.defer.inlineCallbacks
def crawl(crawler_process, project, iso_run_date, target_year):
    yield crawler_process.crawl(project, iso_run_date, target_year)


def run(project, iso_run_date, target_year):
    os.environ.setdefault(
        'SCRAPY_SETTINGS_MODULE', 'scraping.scraping.settings')
    crawler_process = scrapy.crawler.CrawlerProcess(
        scrapy.utils.project.get_project_settings())
    crawl(crawler_process, project, iso_run_date, target_year)
    crawler_process.start()
现在,所有日志消息都以相同的自定义格式输出到相同的日志文件中。此外:

  • 来自主脚本的日志使用变量“project”作为名称
  • 来自spider的日志使用spider名称作为名称
  • 来自Scrapy components的日志使用Scrapy components a名称