Python数据刮取-表单身份验证问题

Python数据刮取-表单身份验证问题,python,python-3.x,web-scraping,forms-authentication,Python,Python 3.x,Web Scraping,Forms Authentication,下面是一些代码,我一直试图使用这些代码登录到库克的插图网站() 我启动一个会话,获取身份验证令牌和一个隐藏的编码字段,然后传递电子邮件和密码字段的“名称”和“值”(通过检查chrome中的元素找到)。表单似乎不包含任何其他元素;但是,post方法不会让我登录 我注意到所有的CSRF令牌都以“==”结尾,所以我尝试删除它们。但它不起作用 我还尝试修改post,使用表单输入的“id”字段而不是“name”(这只是一个暗中拍摄,真的……name似乎应该可以从其他示例中看到的内容工作) 任何想法都将不胜

下面是一些代码,我一直试图使用这些代码登录到库克的插图网站()

我启动一个会话,获取身份验证令牌和一个隐藏的编码字段,然后传递电子邮件和密码字段的“名称”和“值”(通过检查chrome中的元素找到)。表单似乎不包含任何其他元素;但是,post方法不会让我登录

我注意到所有的CSRF令牌都以“==”结尾,所以我尝试删除它们。但它不起作用

我还尝试修改post,使用表单输入的“id”字段而不是“name”(这只是一个暗中拍摄,真的……name似乎应该可以从其他示例中看到的内容工作)

任何想法都将不胜感激

import requests, lxml.html
s = requests.session()

# go to the login page and get its text
login = s.get('https://www.cooksillustrated.com/sign_in')
login_html = lxml.html.fromstring(login.text)

# find the hidden fields names and values; store in a dictionary
hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')
form = {x.attrib['name']: x.attrib['value'] for x in hidden_inputs}
print(form)

# I noticed that they all ended in two = signs, so I tried taking that off
# form['authenticity_token'] = form['authenticity_token'][:-2]

# this adds to the form payload the two named fields for user name and     password
# found using the "inspect elements" on the login screen
form['user[email]'] = 'my_email'
form['user[password]'] = 'my_pw'

# this uses "id" instead of "name" from the input fields
#form['user_email'] = 'my_email'
#form['user_password'] = 'my_pw'

response = s.post('https://www.cooksillustrated.com/sign_in', data=form)
print(form)

# trying to see if it worked - but the response URL is login again instead of main page
# and it can't find my name
# responses are okay, but I think that just means it posted the form
print(response.url)
print('Christopher' in response.text)
print(response.status_code)
print(response.ok)

那么,POST请求URL应该是
https://www.cooksillustrated.com/sessions
,如果您在登录时捕获所有流量,您将发现向服务器发出的实际POST请求:

POST /sessions HTTP/1.1
Host: www.cooksillustrated.com
Connection: keep-alive
Content-Length: 179
Cache-Control: max-age=0
Origin: https://www.cooksillustrated.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://www.cooksillustrated.com/sign_in
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8

utf8=%E2%9C%93&authenticity_token=Uvku64N8V2dq8z%2BGerrqWNobn03Ydjvz8xqgOAvfBmvDM%2B71xJWl2DmRU4zbBE15gGVESmDKP2E16KIqBeAJ0g%3D%3D&user%5Bemail%5D=demo&user%5Bpassword%5D=demodemo
请注意,最后一行是此请求的编码数据,有4个参数,分别是
utf
authenticity\u token
user[email]
user[password]

因此,在您的情况下,
表单
应包括以下所有内容:

form = {'user[email]': 'my_email', 
        'user[password]': 'my_pw', 
        'utf': '✓', 
        'authenticity_token': 'xxxxxx' # make sure you don't ignore '=='
}
此外,您可能希望添加一些标题以显示来自Chrome(或您喜欢的任何浏览器),因为
request
的默认标题是
python requests/2.13.0
,并且一些网站不喜欢来自“bots”的流量:

现在我们已经准备好发出POST请求:

response = s.post('https://www.cooksillustrated.com/sessions', data=form, headers=headers)

CSRF令牌末尾的“==”是Base64字符串。谢谢。这是否意味着它需要解码或删除?还是应该通过“原样”?CSRF代表跨站点请求伪造,当恶意站点、电子邮件、程序等导致用户的浏览器执行不必要的操作时,这是一种攻击类型。代币是防止这种情况发生的一种方法。需要按原样通过。谢谢!将post函数更改为/sessions URL非常有效。它不需要我更改标题,但我会添加它们以避免出现问题。对于任何想知道Shane的标题和表单信息来自何处的人,在Chrome中,您可以进入Inspect>Network>[从左侧选择表单名称]>标题。“过滤器”框似乎不适用于标题文本,但您可以相对轻松地找到最近的操作。
response = s.post('https://www.cooksillustrated.com/sessions', data=form, headers=headers)