Python scrapy-未调用spider模块def函数
我的意图是调用start_requests方法来登录网站。登录后,刮网站。根据日志消息,我看到了 1.但是,我看到start_请求没有被调用。 2.parse的call_back函数也未调用 实际发生的是spider只是在开始URL中加载URL 问题:Python scrapy-未调用spider模块def函数,python,authentication,web-scraping,scrapy,scrapy-spider,Python,Authentication,Web Scraping,Scrapy,Scrapy Spider,我的意图是调用start_requests方法来登录网站。登录后,刮网站。根据日志消息,我看到了 1.但是,我看到start_请求没有被调用。 2.parse的call_back函数也未调用 实际发生的是spider只是在开始URL中加载URL 问题: 为什么蜘蛛不在其他页面上爬行(比如第2、3、4页) 为什么从蜘蛛身上看是不起作用的 注: 我计算页码和url创建的方法是正确的。我核实过了 我引用此链接来编写此代码 我的代码: zauba.py(蜘蛛) loginform.py #!/usr/b
#!/usr/bin/env python
import sys
from argparse import ArgumentParser
from collections import defaultdict
from lxml import html
__version__ = '1.0' # also update setup.py
def _form_score(form):
score = 0
# In case of user/pass or user/pass/remember-me
if len(form.inputs.keys()) in (2, 3):
score += 10
typecount = defaultdict(int)
for x in form.inputs:
type_ = (x.type if isinstance(x, html.InputElement) else 'other'
)
typecount[type_] += 1
if typecount['text'] > 1:
score += 10
if not typecount['text']:
score -= 10
if typecount['password'] == 1:
score += 10
if not typecount['password']:
score -= 10
if typecount['checkbox'] > 1:
score -= 10
if typecount['radio']:
score -= 10
return score
def _pick_form(forms):
"""Return the form most likely to be a login form"""
return sorted(forms, key=_form_score, reverse=True)[0]
def _pick_fields(form):
"""Return the most likely field names for username and password"""
userfield = passfield = emailfield = None
for x in form.inputs:
if not isinstance(x, html.InputElement):
continue
type_ = x.type
if type_ == 'password' and passfield is None:
passfield = x.name
elif type_ == 'text' and userfield is None:
userfield = x.name
elif type_ == 'email' and emailfield is None:
emailfield = x.name
return (userfield or emailfield, passfield)
def submit_value(form):
"""Returns the value for the submit input, if any"""
for x in form.inputs:
if x.type == 'submit' and x.name:
return [(x.name, x.value)]
else:
return []
def fill_login_form(
url,
body,
username,
password,
):
doc = html.document_fromstring(body, base_url=url)
form = _pick_form(doc.xpath('//form'))
(userfield, passfield) = _pick_fields(form)
form.fields[userfield] = username
form.fields[passfield] = password
form_values = form.form_values() + submit_value(form)
return (form_values, form.action or form.base_url, form.method)
def main():
ap = ArgumentParser()
ap.add_argument('-u', '--username', default='username')
ap.add_argument('-p', '--password', default='secret')
ap.add_argument('url')
args = ap.parse_args()
try:
import requests
except ImportError:
print 'requests library is required to use loginform as a tool'
r = requests.get(args.url)
(values, action, method) = fill_login_form(args.url, r.text,
args.username, args.password)
print '''url: {0}
method: {1}
payload:'''.format(action, method)
for (k, v) in values:
print '- {0}: {1}'.format(k, v)
if __name__ == '__main__':
sys.exit(main())
日志消息:
2016-10-02 23:31:28 [scrapy] INFO: Scrapy 1.1.3 started (bot: scraptest)
2016-10-02 23:31:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scraptest.spiders', 'FEED_URI': 'medic.json', 'SPIDER_MODULES': ['scraptest.spiders'], 'BOT_NAME': 'scraptest', 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:39.0) Gecko/20100101 Firefox/39.0', 'FEED_FORMAT': 'json', 'AUTOTHROTTLE_ENABLED': True}
2016-10-02 23:31:28 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.throttle.AutoThrottle']
2016-10-02 23:31:28 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-02 23:31:28 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-02 23:31:28 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-02 23:31:28 [scrapy] INFO: Spider opened
2016-10-02 23:31:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-02 23:31:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-02 23:31:29 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/robots.txt> (referer: None)
2016-10-02 23:31:38 [scrapy] DEBUG: Crawled (200) <GET https://www.zauba.com/import-gold/p-1-hs-code.html> (referer: None)
2016-10-02 23:31:38 [scrapy] INFO: Closing spider (finished)
2016-10-02 23:31:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 558,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 136267,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 3, 6, 31, 38, 560012),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 10, 3, 6, 31, 28, 927872)}
2016-10-02 23:31:38 [scrapy] INFO: Spider closed (finished)
2016-10-02 23:31:28[scrapy]信息:scrapy 1.1.3已启动(bot:scraptest)
2016-10-02 23:31:28[scrapy]信息:覆盖的设置:{'NEWSPIDER_MODULE':'scraptest.SPIDER','FEED_URI':'medic.json','SPIDER_MODULES':['scraptest.SPIDER'],'BOT_NAME':'scraptest','ROBOTSTXT_-obe':True,'USER_-AGENT':'Mozilla/5.0(麦金托什;英特尔Mac OS X 10.11;rv:39.0)Gecko/20100101 Firefox/39.0,“FEED_格式”:“json”,“AUTOTHROTTLE_启用”:True}
2016-10-02 23:31:28[scrapy]信息:启用的扩展:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats',
'scrapy.extensions.throttle.AutoThrottle']
2016-10-02 23:31:28[scrapy]信息:已启用的下载程序中间件:
['scrapy.downloaderMiddleware.robotstxt.RobotsTxtMiddleware',
'scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.DownloaderMiddleware.chunked.ChunkedTransfererMiddleware',
'scrapy.downloadermiddleware.stats.DownloaderStats']
2016-10-02 23:31:28[scrapy]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2016-10-02 23:31:28[scrapy]信息:启用的项目管道:
[]
2016-10-02 23:31:28[剪贴]信息:蜘蛛打开
2016-10-02 23:31:28[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2016-10-02 23:31:28[scrapy]调试:Telnet控制台监听127.0.0.1:6024
2016-10-02 23:31:29[scrapy]调试:爬网(200)(参考:无)
2016-10-02 23:31:38[scrapy]调试:爬网(200)(参考:无)
2016-10-02 23:31:38[scrapy]信息:关闭卡盘(已完成)
2016-10-02 23:31:38[scrapy]信息:倾销scrapy统计数据:
{'downloader/request_bytes':558,
“下载程序/请求计数”:2,
“下载器/请求\方法\计数/获取”:2,
“downloader/response_字节”:136267,
“下载程序/响应计数”:2,
“下载程序/响应状态\计数/200”:2,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2016,10,3,6,31,38560012),
“日志计数/调试”:3,
“日志计数/信息”:7,
“响应\u已收到\u计数”:2,
“调度程序/出列”:1,
“调度程序/出列/内存”:1,
“调度程序/排队”:1,
“调度程序/排队/内存”:1,
“开始时间”:datetime.datetime(2016,10,3,6,31,28927872)}
2016-10-02 23:31:38[scrapy]信息:蜘蛛关闭(完成)
Scrapy已经有名为FormRequest
的表单请求管理器
在大多数情况下,它会自己找到正确的形式。您可以尝试:
>>> scrapy shell "https://www.zauba.com/import-gold/p-1-hs-code.html"
from scrapy import FormRequest
login_data={'name':'mylogin', 'pass':'mypass'})
request = FormRequest.from_response(response, formdata=login_data)
print(request.body)
# b'form_build_id=form-Lf7bFJPTN57MZwoXykfyIV0q3wzZEQqtA5s6Ce-bl5Y&form_id=user_login_block&op=Log+in&pass=mypass&name=mylogin'
一旦您登录任何链接的请求,之后都会附加一个会话cookie,因此您只需在链接开始时登录一次。我发现了我犯的错误 我没有将函数放在类中。这就是为什么。。。。事情没有如预期的那样顺利。现在,我在所有功能中添加了一个选项卡空间,一切都开始正常运行
感谢@user2989777和@Granitosaurus前来调试您使用的是哪一个Scrapy版本?另外,请让我了解有关loginform文件的更多详细信息。我使用的是1.1版,或者它看起来像是函数调用链(启动请求->解析登录->启动爬网->解析->获取页码->提取条目)没有在这里发生。我从日志消息中观察到的是,只调用了start\u URL。是否仍要调试以查看函数调用链?我发现了我犯的错误!!!!
>>> scrapy shell "https://www.zauba.com/import-gold/p-1-hs-code.html"
from scrapy import FormRequest
login_data={'name':'mylogin', 'pass':'mypass'})
request = FormRequest.from_response(response, formdata=login_data)
print(request.body)
# b'form_build_id=form-Lf7bFJPTN57MZwoXykfyIV0q3wzZEQqtA5s6Ce-bl5Y&form_id=user_login_block&op=Log+in&pass=mypass&name=mylogin'