Session cookies 如何将cookie与Scrapy连接并存储
我想登录到一个网站与刮擦,然后调用另一个网址。 到目前为止,我安装了Scrasty并制作了以下脚本:Session cookies 如何将cookie与Scrapy连接并存储,session-cookies,scrapy,Session Cookies,Scrapy,我想登录到一个网站与刮擦,然后调用另一个网址。 到目前为止,我安装了Scrasty并制作了以下脚本: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import FormRequest class LoginSpider2(BaseSpider): name = 'github_login' start_urls = ['ht
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
class LoginSpider2(BaseSpider):
name = 'github_login'
start_urls = ['https://github.com/login']
def parse(self, response):
return [FormRequest.from_response(response, formdata={'login': 'username', 'password': 'password'}, callback=self.after_login)]
def after_login(self, response):
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
else:
self.log("Login succeed", response.body)
启动此脚本后,我得到了日志“登录成功”。
然后我添加了另一个URL,但它不起作用:
为此,我替换了:
start_urls = ['https://github.com/login']
借
但我犯了以下错误:
2013-06-11 22:23:40+0200 [scrapy] DEBUG: Enabled item pipelines:
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 4, in <module>
execute()
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 76, in _run_print_help
func(*a, **kw)
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/Library/Python/2.7/site-packages/scrapy/commands/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/Library/Python/2.7/site-packages/scrapy/spidermanager.py", line 43, in create
raise KeyError("Spider not found: %s" % spider_name)
2013-06-11 22:23:40+0200[scrapy]调试:启用的项目管道:
回溯(最近一次呼叫最后一次):
文件“/usr/local/bin/scrapy”,第4行,在
执行()
文件“/Library/Python/2.7/site packages/scrapy/cmdline.py”,执行中的第131行
_运行\u打印\u帮助(解析器、\u运行\u命令、cmd、args、opts)
文件“/Library/Python/2.7/site packages/scrapy/cmdline.py”,第76行,在“运行”和“打印”帮助中
func(*a,**千瓦)
文件“/Library/Python/2.7/site packages/scrapy/cmdline.py”,第138行,在_run_命令中
cmd.run(参数、选项)
文件“/Library/Python/2.7/site packages/scrapy/commands/crawl.py”,第43行,运行中
spider=self.crawler.spider.create(spname,**opts.spargs)
文件“/Library/Python/2.7/site packages/scrapy/spidermanager.py”,第43行,在create中
raise KeyError(“未找到蜘蛛:%s”%Spider\u name)
我做错了什么?我搜索了stackoverflow,但没有找到正确的响应
谢谢您的错误表明Scrapy无法找到蜘蛛。您是在project/spiders文件夹中创建的吗 无论如何,一旦运行它,您将发现第二个问题:
start\u url
请求的默认回调是self.parse
,这将导致回购页面失败(那里没有登录表单)。它们可能会并行运行,所以当它访问私人回购协议时,它会得到一个错误:P
您应该只在start\u url
中保留登录url,并在after\u login
方法中返回新的请求
,如果有效的话。像这样:
def after_login(self, response):
...
else:
return Request('https://github.com/MyCompany/MyPrivateRepo',
callback=self.parse_repo)
爬行器的name属性设置是否仍然正确?
名称设置不正确/缺失
通常会导致如下错误
def after_login(self, response):
...
else:
return Request('https://github.com/MyCompany/MyPrivateRepo',
callback=self.parse_repo)