Python 伪身份验证
我正在一个项目中尝试使用scrapy。我无法绕过的身份验证系统。 为了理解这个问题,我做了一个简单的请求处理程序Python 伪身份验证,python,authentication,web-scraping,scrapy,Python,Authentication,Web Scraping,Scrapy,我正在一个项目中尝试使用scrapy。我无法绕过的身份验证系统。 为了理解这个问题,我做了一个简单的请求处理程序 import cookielib, urllib2 cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 1
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'),]
url='https://text.westlaw.com/signon/default.wl?RS=ACCS10.10&VR=2.0&newdoor=true&sotype=mup'
r = opener.open(url)
f = open('code.html', 'wb')
f.write(r.read())
f.close()
返回的html代码不包含表单元素。可能有人知道如何说服服务器,我不是一个假浏览器,所以我可以继续进行身份验证?您可以使用
InitSpider
,它允许您进行一些后期处理,例如使用自定义处理程序登录:
class CrawlpySpider(InitSpider):
#...
# Make sure to add the logout page to the denied list
rules = (
Rule(
LinkExtractor(
allow_domains=(self.allowed_domains),
unique=True,
deny=('logout.php'),
),
callback='parse',
follow=True
),
)
def init_request(self):
"""This function is called before crawling starts."""
# Do a login
return Request(url="http://domain.tld/login.php", callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(
response,
formdata={
"username": "admin",
"password": "very-secure",
"reguired-field": "my-value"
},
method="post",
callback=self.check_login_response
)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "incorrect password" not in response.body:
# Now the crawling can begin..
logging.info('Login successful')
return self.initialized()
else:
# Something went wrong, we couldn't log in, so nothing happens.
logging.error('Unable to login')
def parse(self, response):
"""Your stuff here"""
我还刚刚实现了一个工作示例,它正是您想要实现的。看看它:问题是否来自r=opener.urlopen(url)而不是open?