Python Scrapy身份验证会话未通过重定向并每隔一页丢失身份验证
我正在尝试爬网一个目录网站(类似于黄页),它需要登录才能查看每个特定列表中的信息,但我遇到了一个问题,蜘蛛似乎会在爬网其他列表时丢失身份验证 还有一些情况下,目录列表重定向到另一个具有类似结构的域,因此仍然可以以相同的方式对其进行爬网。唯一的区别是它被重定向,然后在原始网站上检查身份验证,然后将身份验证传递到新网站(对此不完全确定,我并不真正精通web身份验证过程) 我已经尝试将Python Scrapy身份验证会话未通过重定向并每隔一页丢失身份验证,python,asp.net,cookies,scrapy,Python,Asp.net,Cookies,Scrapy,我正在尝试爬网一个目录网站(类似于黄页),它需要登录才能查看每个特定列表中的信息,但我遇到了一个问题,蜘蛛似乎会在爬网其他列表时丢失身份验证 还有一些情况下,目录列表重定向到另一个具有类似结构的域,因此仍然可以以相同的方式对其进行爬网。唯一的区别是它被重定向,然后在原始网站上检查身份验证,然后将身份验证传递到新网站(对此不完全确定,我并不真正精通web身份验证过程) 我已经尝试将并发_请求=1放入我的设置.py中,除了有时它成功地通过了一行中的两个清单的身份验证之外,没有多大区别 似乎包含或不包
并发_请求=1
放入我的设置.py
中,除了有时它成功地通过了一行中的两个清单的身份验证之外,没有多大区别
似乎包含或不包含\uu RequestVerificationToken
并没有什么区别
我相信我在使用Scrapy之前登录时遵循了Scrapy的文档,但是如果有什么我可能遗漏的,请帮我挑选出来
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = START_URL
def parse(self, response):
csrf_token = response.xpath('//*[@name="__RequestVerificationToken"]/@value').extract_first()
yield scrapy.FormRequest.from_response(response,
formid='login-form',
clickdata={'type': 'submit'},
formdata={
'__RequestVerificationToken': csrf_token,
'username':'user',
'password': 'pass'},
callback=self.parse_after_login)
def parse_after_login(self, response):
if b"Member Section" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
for url in PAGES:
yield scrapy.Request(url, callback=self.parse_directory_page)
else:
self.log("Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_directory_page(self, response):
print("DETECTED: " + str(response.css('#navright > div.link_login.entry > a::text').extract()) + " BUTTON IN DIRECTORY LISTING PAGE")
...URL Manipulation code
for listing in response.css(LISTING_SELECTOR):
yield scrapy.Request(listing, self_parse_listing)
def parse_listing(self, response):
print("DETECTED: " + str(response.css('#navright > div.link_login.entry > a::text').extract()) + " BUTTON IN COMPANY PAGE")
...scrape and data manipulation
控制台日志带有COOKIES\u DEBUG=True
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/MemberSection> (referer: https://www.example.com/Account/Login)
[myspider] DEBUG: Successfully logged in. Lets start crawling!
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search>
Cookie: ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf;
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search> (referer: https://www.example.com/MemberSection)
DETECTED: ['SIGN OUT'] BUTTON IN DIRECTORY LISTING PAGE
[scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://example.com/directory/members/123640>
Set-Cookie: ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu; path=/; HttpOnly; SameSite=Lax
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com/directory/members/123640> (referer: https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search)
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://example.com/directory/members/119259>
Cookie: ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu
DETECTED: ['SIGN IN'] BUTTON IN COMPANY PAGE
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://webservice.example.com/wcasso/SsoV1/CheckLoggedIn?otherParams> from <GET https://example.com/directory/members/119259>
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://webservice.example.com/wcasso/SsoV1/CheckLoggedIn?otherParams>
Cookie: ASP.NET_SessionId=c2e3avij54xfpn4a5ofl4xdt;sso_signin=user_id=user_id_code;ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.example.com/Account/SsoLoginResult?user_id=000001&login_token=b612f5378f494713&p_action=&p_cid=0&returnurl=https%3a%2f%2fexample.com%2fdirectory%2fmembers%2f119259> from <GET https://webservice.example.com/wcasso/SsoV1/CheckLoggedIn?random=202104200004375582&otherParams>
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.example.com/Account/SsoLoginResult?Params>
Cookie: ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf; __RequestVerificationToken=TokenHere; ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu
[scrapy.core.engine]调试:已爬网(200)(参考:https://www.example.com/Account/Login)
[myspider]调试:已成功登录。让我们开始爬行吧!
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf;
[scrapy.core.engine]调试:已爬网(200)(参考:https://www.example.com/MemberSection)
在目录列表页面中检测到:[“注销”按钮
[scrapy.DownloaderMiddleware.cookies]调试:从以下地址接收cookies:
设置Cookie:ASP.NET_SessionId=iyelchcbqwxbfycplifedyxu;路径=/;HttpOnly;SameSite=Lax
[scrapy.core.engine]调试:已爬网(200)(参考:https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search)
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=iyelchcbqwxbfycplifedyxu
在公司页面中检测到:[“登录”按钮
[scrapy.DownloaderMiddleware.redirect]调试:将(302)重定向到
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=c2e3avij54xfpn4a5oflx4xdt;sso_sign=user_id=user_id\u code;ASP.NET_SessionId=iyclhcbqwxbfyclifedyxu
[scrapy.DownloaderMiddleware.redirect]调试:将(301)重定向到
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf;uu RequestVerificationToken=TokenHere;ASP.NET_SessionId=iyelchcbqwxbfyclifedyxu
*我从调试中删除/更改了一些信息,因为stackoverflow似乎认为这是垃圾邮件。看起来你的爬虫程序正在戳出“注销”按钮,使sessionid服务器端无效。你需要从爬虫中排除“注销”。我认为这不可能,因为我登录的其他网页都没有登录检查如果已注销,则执行操作以登录。我“登录”时的会话ID也会在首次登录后与初始会话ID一起签出。