Python Scrapy身份验证会话未通过重定向并每隔一页丢失身份验证

Python Scrapy身份验证会话未通过重定向并每隔一页丢失身份验证,python,asp.net,cookies,scrapy,Python,Asp.net,Cookies,Scrapy,我正在尝试爬网一个目录网站(类似于黄页),它需要登录才能查看每个特定列表中的信息,但我遇到了一个问题,蜘蛛似乎会在爬网其他列表时丢失身份验证 还有一些情况下,目录列表重定向到另一个具有类似结构的域,因此仍然可以以相同的方式对其进行爬网。唯一的区别是它被重定向,然后在原始网站上检查身份验证,然后将身份验证传递到新网站(对此不完全确定,我并不真正精通web身份验证过程) 我已经尝试将并发_请求=1放入我的设置.py中,除了有时它成功地通过了一行中的两个清单的身份验证之外,没有多大区别 似乎包含或不包

我正在尝试爬网一个目录网站(类似于黄页),它需要登录才能查看每个特定列表中的信息,但我遇到了一个问题,蜘蛛似乎会在爬网其他列表时丢失身份验证

还有一些情况下,目录列表重定向到另一个具有类似结构的域,因此仍然可以以相同的方式对其进行爬网。唯一的区别是它被重定向,然后在原始网站上检查身份验证,然后将身份验证传递到新网站(对此不完全确定,我并不真正精通web身份验证过程)

我已经尝试将
并发_请求=1
放入我的
设置.py
中,除了有时它成功地通过了一行中的两个清单的身份验证之外,没有多大区别

似乎包含或不包含
\uu RequestVerificationToken
并没有什么区别

我相信我在使用Scrapy之前登录时遵循了Scrapy的文档,但是如果有什么我可能遗漏的,请帮我挑选出来

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = START_URL

    def parse(self, response):
        csrf_token = response.xpath('//*[@name="__RequestVerificationToken"]/@value').extract_first()
        yield scrapy.FormRequest.from_response(response,
                                        formid='login-form',
                                        clickdata={'type': 'submit'},
                                        formdata={
                                                '__RequestVerificationToken': csrf_token,
                                                'username':'user',
                                                'password': 'pass'},
                                            callback=self.parse_after_login)

    def parse_after_login(self, response):
        if b"Member Section" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            for url in PAGES:
                yield scrapy.Request(url, callback=self.parse_directory_page)
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_directory_page(self, response):
        print("DETECTED: " + str(response.css('#navright > div.link_login.entry > a::text').extract()) + " BUTTON IN DIRECTORY LISTING PAGE")
        
        ...URL Manipulation code

        for listing in response.css(LISTING_SELECTOR):
            yield scrapy.Request(listing, self_parse_listing)

    def parse_listing(self, response):
        print("DETECTED: " + str(response.css('#navright > div.link_login.entry > a::text').extract()) + " BUTTON IN COMPANY PAGE")

        ...scrape and data manipulation

控制台日志带有
COOKIES\u DEBUG=True

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/MemberSection> (referer: https://www.example.com/Account/Login)
[myspider] DEBUG: Successfully logged in. Lets start crawling!
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search>
Cookie: ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf; 

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search> (referer: https://www.example.com/MemberSection)  
DETECTED: ['SIGN OUT'] BUTTON IN DIRECTORY LISTING PAGE
[scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://example.com/directory/members/123640>
Set-Cookie: ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu; path=/; HttpOnly; SameSite=Lax

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com/directory/members/123640> (referer: https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search)
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://example.com/directory/members/119259>
Cookie: ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu

DETECTED: ['SIGN IN'] BUTTON IN COMPANY PAGE
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://webservice.example.com/wcasso/SsoV1/CheckLoggedIn?otherParams> from <GET https://example.com/directory/members/119259>
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://webservice.example.com/wcasso/SsoV1/CheckLoggedIn?otherParams>
Cookie: ASP.NET_SessionId=c2e3avij54xfpn4a5ofl4xdt;sso_signin=user_id=user_id_code;ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu

[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.example.com/Account/SsoLoginResult?user_id=000001&login_token=b612f5378f494713&p_action=&p_cid=0&returnurl=https%3a%2f%2fexample.com%2fdirectory%2fmembers%2f119259> from <GET https://webservice.example.com/wcasso/SsoV1/CheckLoggedIn?random=202104200004375582&otherParams>
[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.example.com/Account/SsoLoginResult?Params>
Cookie: ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf; __RequestVerificationToken=TokenHere; ASP.NET_SessionId=iyeclhcbqwxbfycplifedyxu


[scrapy.core.engine]调试:已爬网(200)(参考:https://www.example.com/Account/Login)
[myspider]调试:已成功登录。让我们开始爬行吧!
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf;
[scrapy.core.engine]调试:已爬网(200)(参考:https://www.example.com/MemberSection)  
在目录列表页面中检测到:[“注销”按钮
[scrapy.DownloaderMiddleware.cookies]调试:从以下地址接收cookies:
设置Cookie:ASP.NET_SessionId=iyelchcbqwxbfycplifedyxu;路径=/;HttpOnly;SameSite=Lax
[scrapy.core.engine]调试:已爬网(200)(参考:https://www.example.com/Directory?siteID=24&au=m&pageIndex=1&pageSize=100&searchby=CountryCode&country=NZ&city=&keyword=&orderby=CountryCity&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=38&networkIds=103&layout=v1&submitted=search)
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=iyelchcbqwxbfycplifedyxu
在公司页面中检测到:[“登录”按钮
[scrapy.DownloaderMiddleware.redirect]调试:将(302)重定向到
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=c2e3avij54xfpn4a5oflx4xdt;sso_sign=user_id=user_id\u code;ASP.NET_SessionId=iyclhcbqwxbfyclifedyxu
[scrapy.DownloaderMiddleware.redirect]调试:将(301)重定向到
[scrapy.DownloaderMiddleware.cookies]调试:将cookies发送到:
Cookie:ASP.NET_SessionId=lgvbj33sbnlbsyp4hpe5jhxf;uu RequestVerificationToken=TokenHere;ASP.NET_SessionId=iyelchcbqwxbfyclifedyxu

*我从调试中删除/更改了一些信息,因为stackoverflow似乎认为这是垃圾邮件。

看起来你的爬虫程序正在戳出“注销”按钮,使sessionid服务器端无效。你需要从爬虫中排除“注销”。我认为这不可能,因为我登录的其他网页都没有登录检查如果已注销,则执行操作以登录。我“登录”时的会话ID也会在首次登录后与初始会话ID一起签出。