Python数据刮取-表单身份验证问题_Python_Python 3.x_Web Scraping_Forms Authentication

Python数据刮取-表单身份验证问题

python python-3.x web-scraping

Python数据刮取-表单身份验证问题,python,python-3.x,web-scraping,forms-authentication,Python,Python 3.x,Web Scraping,Forms Authentication,下面是一些代码，我一直试图使用这些代码登录到库克的插图网站（）我启动一个会话，获取身份验证令牌和一个隐藏的编码字段，然后传递电子邮件和密码字段的“名称”和“值”（通过检查chrome中的元素找到）。表单似乎不包含任何其他元素；但是，post方法不会让我登录我注意到所有的CSRF令牌都以“==”结尾，所以我尝试删除它们。但它不起作用我还尝试修改post，使用表单输入的“id”字段而不是“name”（这只是一个暗中拍摄，真的……name似乎应该可以从其他示例中看到的内容工作）任何想法都将不胜

下面是一些代码，我一直试图使用这些代码登录到库克的插图网站（）

我启动一个会话，获取身份验证令牌和一个隐藏的编码字段，然后传递电子邮件和密码字段的“名称”和“值”（通过检查chrome中的元素找到）。表单似乎不包含任何其他元素；但是，post方法不会让我登录

我注意到所有的CSRF令牌都以“==”结尾，所以我尝试删除它们。但它不起作用

我还尝试修改post，使用表单输入的“id”字段而不是“name”（这只是一个暗中拍摄，真的……name似乎应该可以从其他示例中看到的内容工作）

任何想法都将不胜感激

import requests, lxml.html
s = requests.session()

# go to the login page and get its text
login = s.get('https://www.cooksillustrated.com/sign_in')
login_html = lxml.html.fromstring(login.text)

# find the hidden fields names and values; store in a dictionary
hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')
form = {x.attrib['name']: x.attrib['value'] for x in hidden_inputs}
print(form)

# I noticed that they all ended in two = signs, so I tried taking that off
# form['authenticity_token'] = form['authenticity_token'][:-2]

# this adds to the form payload the two named fields for user name and     password
# found using the "inspect elements" on the login screen
form['user[email]'] = 'my_email'
form['user[password]'] = 'my_pw'

# this uses "id" instead of "name" from the input fields
#form['user_email'] = 'my_email'
#form['user_password'] = 'my_pw'

response = s.post('https://www.cooksillustrated.com/sign_in', data=form)
print(form)

# trying to see if it worked - but the response URL is login again instead of main page
# and it can't find my name
# responses are okay, but I think that just means it posted the form
print(response.url)
print('Christopher' in response.text)
print(response.status_code)
print(response.ok)

那么，POST请求URL应该是

https://www.cooksillustrated.com/sessions

，如果您在登录时捕获所有流量，您将发现向服务器发出的实际POST请求：

POST /sessions HTTP/1.1
Host: www.cooksillustrated.com
Connection: keep-alive
Content-Length: 179
Cache-Control: max-age=0
Origin: https://www.cooksillustrated.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://www.cooksillustrated.com/sign_in
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8

utf8=%E2%9C%93&authenticity_token=Uvku64N8V2dq8z%2BGerrqWNobn03Ydjvz8xqgOAvfBmvDM%2B71xJWl2DmRU4zbBE15gGVESmDKP2E16KIqBeAJ0g%3D%3D&user%5Bemail%5D=demo&user%5Bpassword%5D=demodemo

请注意，最后一行是此请求的编码数据，有4个参数，分别是

utf

、

authenticity\u token

、

user[email]

和

user[password]

因此，在您的情况下，

表单

应包括以下所有内容：

form = {'user[email]': 'my_email', 
        'user[password]': 'my_pw', 
        'utf': '&#x2713;', 
        'authenticity_token': 'xxxxxx' # make sure you don't ignore '=='
}

此外，您可能希望添加一些标题以显示来自Chrome（或您喜欢的任何浏览器），因为

request

的默认标题是

python requests/2.13.0

，并且一些网站不喜欢来自“bots”的流量：

现在我们已经准备好发出POST请求：

response = s.post('https://www.cooksillustrated.com/sessions', data=form, headers=headers)

CSRF令牌末尾的“==”是Base64字符串。谢谢。这是否意味着它需要解码或删除？还是应该通过“原样”？CSRF代表跨站点请求伪造，当恶意站点、电子邮件、程序等导致用户的浏览器执行不必要的操作时，这是一种攻击类型。代币是防止这种情况发生的一种方法。需要按原样通过。谢谢！将post函数更改为/sessions URL非常有效。它不需要我更改标题，但我会添加它们以避免出现问题。对于任何想知道Shane的标题和表单信息来自何处的人，在Chrome中，您可以进入Inspect>Network>[从左侧选择表单名称]>标题。“过滤器”框似乎不适用于标题文本，但您可以相对轻松地找到最近的操作。

response = s.post('https://www.cooksillustrated.com/sessions', data=form, headers=headers)