Python 刮痧罐头';登录后无法解析URL
我的脚本有问题。我已成功登录到需要爬网的网站,但在登录后,我在“开始\u URL/login\u”页面上保持被阻止状态,而不是解析URL“” 我在Stackoverflow上看到了很多关于这个的话题,但是没有一个有相同的问题 以下是我正在使用的代码:Python 刮痧罐头';登录后无法解析URL,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我的脚本有问题。我已成功登录到需要爬网的网站,但在登录后,我在“开始\u URL/login\u”页面上保持被阻止状态,而不是解析URL“” 我在Stackoverflow上看到了很多关于这个的话题,但是没有一个有相同的问题 以下是我正在使用的代码: import re from scrapy.spiders.init import InitSpider from scrapy.http import Request, FormRequest from items import temaSpid
import re
from scrapy.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from items import temaSpiderItem
class temaSpider(InitSpider):
name = 'temaSpider'
allowed_domains = ['http://tm-alumni.eu', 'http://www.tm-alumni.eu/#/annuaire/' ]
start_urls = ['http://www.tm-alumni.eu/']
login_page = 'http://www.tm-alumni.eu/'
directory_page = 'http://www.tm-alumni.eu/#/annuaire/diplomes?user_type=1&filterGeo=1&activation_status=-1&view=tromb&page=1'
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
return FormRequest.from_response(response,
formxpath="//form[@id='loginform']",
formdata={'username': 'email', 'password': 'password'},
callback=self.check_login,
dont_filter=True)
def check_login(self, response):
if "My Name" in response.body:
self.log("=========Successfully logged in.=========")
return Request(url=self.directory_page, callback=self.parse_directory, dont_filter=True)
else:
self.log("=========An error in login occurred.=========")
def parse_directory(self, response):
self.log("=========Data is flowing.=========")
self.log(response.url)
以下是我在控制台中找到的内容:
2016-01-05 11:48:49 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2016-01-05 11:48:49 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-05 11:48:49 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'Alumnis.json'}
2016-01-05 11:48:50 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-05 11:48:50 [boto] DEBUG: Retrieving credentials from metadata server.
2016-01-05 11:48:51 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-01-05 11:48:51 [boto] ERROR: Unable to read instance data, giving up
2016-01-05 11:48:51 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddlewar
e, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddlewar
e, ChunkedTransferMiddleware, DownloaderStats
2016-01-05 11:48:51 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthM
iddleware, DepthMiddleware
2016-01-05 11:48:51 [scrapy] INFO: Enabled item pipelines:
2016-01-05 11:48:51 [scrapy] INFO: Spider opened
2016-01-05 11:48:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-05 11:48:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-05 11:48:52 [scrapy] DEBUG: Crawled (200) <GET http://www.tm-alumni.eu/> (referer: None)
2016-01-05 11:48:53 [scrapy] DEBUG: Crawled (200) <POST http://www.tm-alumni.eu/authentication/index/login> (referer: http://www.tm-
alumni.eu/)
2016-01-05 11:48:53 [temaSpider] DEBUG: =========Successfully logged in.=========
2016-01-05 11:48:55 [scrapy] DEBUG: Crawled (200) <GET http://www.tm-alumni.eu/#/annuaire/diplomes?user_type=1&filterGeo=1&activatio
n_status=-1&view=tromb&page=1> (referer: http://www.tm-alumni.eu/authentication/index/login)
2016-01-05 11:48:55 [temaSpider] DEBUG: =========Data is flowing.=========
2016-01-05 11:48:55 [temaSpider] DEBUG: http://www.tm-alumni.eu/
2016-01-05 11:48:49[scrapy]信息:scrapy 1.0.3已启动(bot:scrapybot)
2016-01-05 11:48:49[scrapy]信息:可选功能:ssl、http11、boto
2016-01-05 11:48:49[剪贴]信息:覆盖的设置:{'FEED_FORMAT':'json','FEED_URI':'Alumnis.json'}
2016-01-05 11:48:50[scrapy]信息:启用的扩展:CloseSpider、FeedExporter、TelnetConsole、LogStats、CoreStats、SpiderState
2016-01-05 11:48:50[boto]调试:从元数据服务器检索凭据。
2016-01-05 11:48:51[boto]错误:读取实例数据时捕获异常
回溯(最近一次呼叫最后一次):
文件“C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\site packages\boto\utils.py”,第210行,在重试url中
r=打开器。打开(请求,超时=超时)
文件“C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py”,第431行,打开
响应=自身打开(请求,数据)
文件“C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py”,第449行,处于打开状态
"开放",
文件“C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py”,第409行,在调用链中
结果=func(*args)
文件“C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py”,第1227行,在http\u open中
返回self.do_open(httplib.HTTPConnection,req)
文件“C:\Users\nmitchell\AppData\Local\Continuum\Anaconda\lib\urllib2.py”,第1197行,打开
引发URL错误(err)
URL错误:
2016-01-05 11:48:51[boto]错误:无法读取实例数据,放弃
2016-01-05 11:48:51[scrapy]信息:启用的下载程序中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddlewar
e、 RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、CookiesMiddlewar
e、 分块传输软件,下载程序状态
2016-01-05 11:48:51[scrapy]信息:启用的蜘蛛中间件:HttpErrorMiddleware、OffsiteMiddleware、RefererMiddleware、UrlLengthM
iddleware,DepthMiddleware
2016-01-05 11:48:51[scrapy]信息:启用的项目管道:
2016-01-05 11:48:51[剪贴]信息:蜘蛛打开
2016-01-05 11:48:51[抓取]信息:抓取0页(以0页/分钟的速度),抓取0项(以0项/分钟的速度)
2016-01-05 11:48:51[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2016-01-05 11:48:52[scrapy]调试:爬网(200)(参考:无)
2016-01-05 11:48:53[scrapy]调试:爬网(200)(参考:http://www.tm-
校友(eu/)
2016-01-05 11:48:53[temaSpider]调试:==========已成功登录=========
2016-01-05 11:48:55[scrapy]调试:爬网(200)(参考:http://www.tm-alumni.eu/authentication/index/login)
2016-01-05 11:48:55[temaSpider]调试:数据正在流动=========
2016-01-05 11:48:55[temaSpider]调试:http://www.tm-alumni.eu/
提前感谢您的帮助 您的parse_目录例程被调用,但它不会查看由请求排队的页面。你怎么知道它不是正确的页面?我知道它不是正确的页面,因为response.url是而不是,并且我从页面中提取的链接不是我要查找的链接。你能解决这个问题吗?你的parse_目录例程被调用,但它不会查看由请求排队的页面。你怎么知道它不是正确的页面?我知道它不是正确的页面,因为response.url是而不是,我从页面中提取的链接不是我要查找的链接。你能解决这个问题吗?