Python scrapy-未调用spider模块def函数

Python scrapy-未调用spider模块def函数,python,authentication,web-scraping,scrapy,scrapy-spider,Python,Authentication,Web Scraping,Scrapy,Scrapy Spider,我的意图是调用start_requests方法来登录网站。登录后,刮网站。根据日志消息,我看到了 1.但是,我看到start_请求没有被调用。 2.parse的call_back函数也未调用 实际发生的是spider只是在开始URL中加载URL 问题: 为什么蜘蛛不在其他页面上爬行(比如第2、3、4页) 为什么从蜘蛛身上看是不起作用的 注: 我计算页码和url创建的方法是正确的。我核实过了 我引用此链接来编写此代码 我的代码: zauba.py(蜘蛛) loginform.py #!/usr/b

我的意图是调用start_requests方法来登录网站。登录后,刮网站。根据日志消息,我看到了 1.但是,我看到start_请求没有被调用。 2.parse的call_back函数也未调用

实际发生的是spider只是在开始URL中加载URL

问题:

  • 为什么蜘蛛不在其他页面上爬行(比如第2、3、4页)
  • 为什么从蜘蛛身上看是不起作用的
  • 注:

  • 我计算页码和url创建的方法是正确的。我核实过了
  • 我引用此链接来编写此代码
  • 我的代码:

    zauba.py(蜘蛛)

    loginform.py

    #!/usr/bin/env python
    
    import sys
    from argparse import ArgumentParser
    from collections import defaultdict
    from lxml import html
    
    __version__ = '1.0'  # also update setup.py
    
    
    def _form_score(form):
        score = 0
    
        # In case of user/pass or user/pass/remember-me
    
        if len(form.inputs.keys()) in (2, 3):
            score += 10
    
        typecount = defaultdict(int)
        for x in form.inputs:
            type_ = (x.type if isinstance(x, html.InputElement) else 'other'
                     )
            typecount[type_] += 1
    
        if typecount['text'] > 1:
            score += 10
        if not typecount['text']:
            score -= 10
    
        if typecount['password'] == 1:
            score += 10
        if not typecount['password']:
            score -= 10
    
        if typecount['checkbox'] > 1:
            score -= 10
        if typecount['radio']:
            score -= 10
    
        return score
    
    
    def _pick_form(forms):
        """Return the form most likely to be a login form"""
    
        return sorted(forms, key=_form_score, reverse=True)[0]
    
    
    def _pick_fields(form):
        """Return the most likely field names for username and password"""
    
        userfield = passfield = emailfield = None
        for x in form.inputs:
            if not isinstance(x, html.InputElement):
                continue
    
            type_ = x.type
            if type_ == 'password' and passfield is None:
                passfield = x.name
            elif type_ == 'text' and userfield is None:
                userfield = x.name
            elif type_ == 'email' and emailfield is None:
                emailfield = x.name
    
        return (userfield or emailfield, passfield)
    
    
    def submit_value(form):
        """Returns the value for the submit input, if any"""
    
        for x in form.inputs:
            if x.type == 'submit' and x.name:
                return [(x.name, x.value)]
        else:
            return []
    
    
    def fill_login_form(
        url,
        body,
        username,
        password,
        ):
        doc = html.document_fromstring(body, base_url=url)
        form = _pick_form(doc.xpath('//form'))
        (userfield, passfield) = _pick_fields(form)
        form.fields[userfield] = username
        form.fields[passfield] = password
        form_values = form.form_values() + submit_value(form)
        return (form_values, form.action or form.base_url, form.method)
    
    
    def main():
        ap = ArgumentParser()
        ap.add_argument('-u', '--username', default='username')
        ap.add_argument('-p', '--password', default='secret')
        ap.add_argument('url')
        args = ap.parse_args()
    
        try:
            import requests
        except ImportError:
            print 'requests library is required to use loginform as a tool'
    
        r = requests.get(args.url)
        (values, action, method) = fill_login_form(args.url, r.text,
                args.username, args.password)
        print '''url: {0}
    method: {1}
    payload:'''.format(action, method)
        for (k, v) in values:
            print '- {0}: {1}'.format(k, v)
    
    
    if __name__ == '__main__':
        sys.exit(main())
    
    日志消息:

    2016-10-02 23:31:28 [scrapy] INFO: Scrapy 1.1.3 started (bot: scraptest)
    2016-10-02 23:31:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scraptest.spiders', 'FEED_URI': 'medic.json', 'SPIDER_MODULES': ['scraptest.spiders'], 'BOT_NAME': 'scraptest', 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:39.0) Gecko/20100101 Firefox/39.0', 'FEED_FORMAT': 'json', 'AUTOTHROTTLE_ENABLED': True}
    2016-10-02 23:31:28 [scrapy] INFO: Enabled extensions:
    ['scrapy.extensions.feedexport.FeedExporter',
     'scrapy.extensions.logstats.LogStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.throttle.AutoThrottle']
    2016-10-02 23:31:28 [scrapy] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2016-10-02 23:31:28 [scrapy] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2016-10-02 23:31:28 [scrapy] INFO: Enabled item pipelines:
    []
    2016-10-02 23:31:28 [scrapy] INFO: Spider opened
    2016-10-02 23:31:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-10-02 23:31:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
    2016-10-02 23:31:29 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/robots.txt> (referer: None)
    2016-10-02 23:31:38 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/import-gold/p-1-hs-code.html> (referer: None)
    2016-10-02 23:31:38 [scrapy] INFO: Closing spider (finished)
    2016-10-02 23:31:38 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 558,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 136267,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 10, 3, 6, 31, 38, 560012),
     'log_count/DEBUG': 3,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2016, 10, 3, 6, 31, 28, 927872)}
    2016-10-02 23:31:38 [scrapy] INFO: Spider closed (finished)
    
    2016-10-02 23:31:28[scrapy]信息:scrapy 1.1.3已启动(bot:scraptest)
    2016-10-02 23:31:28[scrapy]信息:覆盖的设置:{'NEWSPIDER_MODULE':'scraptest.SPIDER','FEED_URI':'medic.json','SPIDER_MODULES':['scraptest.SPIDER'],'BOT_NAME':'scraptest','ROBOTSTXT_-obe':True,'USER_-AGENT':'Mozilla/5.0(麦金托什;英特尔Mac OS X 10.11;rv:39.0)Gecko/20100101 Firefox/39.0,“FEED_格式”:“json”,“AUTOTHROTTLE_启用”:True}
    2016-10-02 23:31:28[scrapy]信息:启用的扩展:
    ['scrapy.extensions.feedexport.FeedExporter',
    'scrapy.extensions.logstats.logstats',
    'scrapy.extensions.telnet.TelnetConsole',
    'scrapy.extensions.corestats.corestats',
    'scrapy.extensions.throttle.AutoThrottle']
    2016-10-02 23:31:28[scrapy]信息:已启用的下载程序中间件:
    ['scrapy.downloaderMiddleware.robotstxt.RobotsTxtMiddleware',
    'scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
    'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
    'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
    'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
    'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
    'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
    'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
    'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
    “scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
    'scrapy.DownloaderMiddleware.chunked.ChunkedTransfererMiddleware',
    'scrapy.downloadermiddleware.stats.DownloaderStats']
    2016-10-02 23:31:28[scrapy]信息:启用的蜘蛛中间件:
    ['scrapy.spidermiddleware.httperror.httperror中间件',
    '刮皮.SpiderMiddleware.场外.场外Iddleware',
    “scrapy.Spidermiddleware.referer.RefererMiddleware”,
    'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
    'scrapy.spidermiddleware.depth.DepthMiddleware']
    2016-10-02 23:31:28[scrapy]信息:启用的项目管道:
    []
    2016-10-02 23:31:28[剪贴]信息:蜘蛛打开
    2016-10-02 23:31:28[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
    2016-10-02 23:31:28[scrapy]调试:Telnet控制台监听127.0.0.1:6024
    2016-10-02 23:31:29[scrapy]调试:爬网(200)(参考:无)
    2016-10-02 23:31:38[scrapy]调试:爬网(200)(参考:无)
    2016-10-02 23:31:38[scrapy]信息:关闭卡盘(已完成)
    2016-10-02 23:31:38[scrapy]信息:倾销scrapy统计数据:
    {'downloader/request_bytes':558,
    “下载程序/请求计数”:2,
    “下载器/请求\方法\计数/获取”:2,
    “downloader/response_字节”:136267,
    “下载程序/响应计数”:2,
    “下载程序/响应状态\计数/200”:2,
    “完成原因”:“完成”,
    “完成时间”:datetime.datetime(2016,10,3,6,31,38560012),
    “日志计数/调试”:3,
    “日志计数/信息”:7,
    “响应\u已收到\u计数”:2,
    “调度程序/出列”:1,
    “调度程序/出列/内存”:1,
    “调度程序/排队”:1,
    “调度程序/排队/内存”:1,
    “开始时间”:datetime.datetime(2016,10,3,6,31,28927872)}
    2016-10-02 23:31:38[scrapy]信息:蜘蛛关闭(完成)
    
    Scrapy已经有名为
    FormRequest
    的表单请求管理器

    在大多数情况下,它会自己找到正确的形式。您可以尝试:

    >>> scrapy shell "https://www.zauba.com/import-gold/p-1-hs-code.html"
    from scrapy import FormRequest
    login_data={'name':'mylogin', 'pass':'mypass'})
    request = FormRequest.from_response(response, formdata=login_data)
    print(request.body)
    # b'form_build_id=form-Lf7bFJPTN57MZwoXykfyIV0q3wzZEQqtA5s6Ce-bl5Y&form_id=user_login_block&op=Log+in&pass=mypass&name=mylogin'
    

    一旦您登录任何链接的请求,之后都会附加一个会话cookie,因此您只需在链接开始时登录一次。

    我发现了我犯的错误

    我没有将函数放在类中。这就是为什么。。。。事情没有如预期的那样顺利。现在,我在所有功能中添加了一个选项卡空间,一切都开始正常运行


    感谢@user2989777和@Granitosaurus前来调试

    您使用的是哪一个Scrapy版本?另外,请让我了解有关loginform文件的更多详细信息。我使用的是1.1版,或者它看起来像是函数调用链(启动请求->解析登录->启动爬网->解析->获取页码->提取条目)没有在这里发生。我从日志消息中观察到的是,只调用了start\u URL。是否仍要调试以查看函数调用链?我发现了我犯的错误!!!!
    >>> scrapy shell "https://www.zauba.com/import-gold/p-1-hs-code.html"
    from scrapy import FormRequest
    login_data={'name':'mylogin', 'pass':'mypass'})
    request = FormRequest.from_response(response, formdata=login_data)
    print(request.body)
    # b'form_build_id=form-Lf7bFJPTN57MZwoXykfyIV0q3wzZEQqtA5s6Ce-bl5Y&form_id=user_login_block&op=Log+in&pass=mypass&name=mylogin'