Python';s请求触发Cloudflare';s安全性,而urllib不安全

Python';s请求触发Cloudflare';s安全性,而urllib不安全,python,python-3.x,web-scraping,python-requests,Python,Python 3.x,Web Scraping,Python Requests,我正在为一家餐厅网站开发一个自动网络垃圾处理程序,但我遇到了一个问题。该网站使用cloudlfare的反机器人安全,我想绕过它,不是攻击模式,而是验证码测试,只有在检测到非美国IP或机器人时才会触发。我试图绕过它,因为当我清除cookie、禁用javascript或使用美国代理时,cloudflare的安全性不会触发 知道了这一点,我尝试使用python的请求库: import requests headers = {'User-Agent': 'Mozilla/5.0 (Windows NT

我正在为一家餐厅网站开发一个自动网络垃圾处理程序,但我遇到了一个问题。该网站使用cloudlfare的反机器人安全,我想绕过它,不是攻击模式,而是验证码测试,只有在检测到非美国IP或机器人时才会触发。我试图绕过它,因为当我清除cookie、禁用javascript或使用美国代理时,cloudflare的安全性不会触发

知道了这一点,我尝试使用python的请求库:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
response = requests.get("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers).text
print(response)
但这最终会触发Cloudflare,无论我使用什么代理

但是使用urllib.request时,使用相同的头时:

import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
request = urllib.request.Request("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers)
r = urllib.request.urlopen(request).read()
print(r.decode('utf-8'))
当使用相同的美国IP运行时,这一次不会触发Cloudflare的安全性,即使它使用与请求库相同的头和IP

因此,我试图找出在请求库中触发cloudflare的确切原因,而这些请求库不在urllib库中

虽然典型的答案是“那就使用urllib吧”,但我想弄清楚请求到底有什么不同,以及如何修复它,首先要了解请求是如何工作的,cloudflare会检测到机器人程序,但也可以将我能找到的任何修复应用到其他HttpLib(特别是异步的HttpLib)

编辑第2条:迄今为止的进展情况:

多亏了@TuanGeek,我们现在可以使用请求绕过cloudflare块,只要我们直接连接到主机IP而不是域名(出于某种原因,带有请求的DNS重定向会触发cloudflare,但urllib不会):

注意:尝试通过http(而不是验证变量设置为False的https)访问将触发cloudflare的阻止

现在这很好,但不幸的是,我的最终目标是使它与httplib HTTPX异步工作,但仍然没有实现,因为使用以下代码,cloudflare块仍然会被触发,即使我们直接通过主机IP连接,具有正确的头,并且verify设置为False:

import trio
import httpx
import socket
from collections import OrderedDict
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
async def asks_worker():
    async with httpx.AsyncClient(headers=headers, verify=False) as s:
        r = await s.get(f'https://{address}/guest/accountlogin')
        print(r.text)
async def run_task():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(asks_worker)
trio.run(run_task)
编辑N°1:有关更多详细信息,这里是来自urllib和请求的原始http请求

要求:

send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Date: Thu, 02 Jul 2020 20:20:06 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: close
header: CF-Chl-Bypass: 1
header: Set-Cookie: __cfduid=df8902e0b19c21b364f3bf33e0b1ce1981593721256; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Expires: Thu, 01 Jan 1970 00:00:01 GMT
header: X-Frame-Options: SAMEORIGIN
header: cf-request-id: 03b2c8d09300000ca181928200000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=df8962e1b27c25b364f3bf66e8b1ce1981593723206; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Vary: Accept-Encoding
header: Server: cloudflare
header: CF-RAY: 5acb25c75c981ca1-EWR
URLLIB:

send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 02 Jul 2020 20:20:01 GMT
header: Content-Type: text/html;charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Set-Cookie: __cfduid=db9de9687b6c22e6c12b33250a0ded3251292457801; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Expires: Thu, 2 Jul 2020 20:20:01 GMT
header: Cache-Control: no-cache, private, no-store
header: X-Powered-By: Undertow/1
header: Pragma: no-cache
header: X-Frame-Options: SAMEORIGIN
header: Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://use.typekit.net connect.facebook.net/ https://googleads.g.doubleclick.net/ app.pendo.io cdn.pendo.io pendo-static-6351154740266000.storage.googleapis.com pendo-io-static.storage.googleapis.com https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.google.com/recaptcha/api.js apis.google.com https://www.googletagmanager.com api.instagram.com https://app-rsrc.getbee.io/plugin/BeePlugin.js https://loader.getbee.io api.instagram.com https://bat.bing.com/bat.js https://www.googleadservices.com/pagead/conversion.js https://connect.facebook.net/en_US/fbevents.js  https://connect.facebook.net/ https://fonts.googleapis.com/ https://ssl.gstatic.com/ https://tagmanager.google.com/;style-src 'unsafe-inline' *;img-src * data:;connect-src 'self' app.pendo.io api.feedback.us.pendo.io; frame-ancestors 'self' app.pendo.io pxsweb.com *.pxsweb.com;frame-src 'self' *.myguestaccount.com https://app.getbee.io/ *;
header: X-Lift-Version: Unknown Lift Version
header: CF-Cache-Status: DYNAMIC
header: cf-request-id: 01b2c5b1fa00002654a25485710000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Server: cloudflare
header: CF-RAY: 5acb58a62c5b5144-EWR

这真的引起了我的兴趣。
请求我能够使用的
解决方案

解决方案 最后缩小问题的范围。当您使用请求时,它使用urllib3连接池。常规urllib3连接和连接池之间似乎存在一些不一致。一个有效的解决办法:

导入请求
从集合导入订单
来自导入会话的请求
导入套接字
#使用socket.getaddrinfo获取地址
answers=socket.getaddrinfo('grimaldis.myguestaccount.com',443)
(系列、类型、原型、名称、(地址、端口))=答案[0]
s=会话()
headers=OrderedDict({
“接受编码”:“gzip,deflate,br”,
'主机':“grimaldis.myguestaccount.com”,
“用户代理”:“Mozilla/5.0(Windows NT 10.0;Win64;x64;rv:77.0)Gecko/20100101 Firefox/77.0”
})
s、 标题=标题
response=s.get(f“https://{address}/guest/accountlogin”,headers=headers,verify=False)
打印(答复)
技术背景 所以我通过Burp套件运行了这两种方法来比较请求。下面是请求的原始转储

使用请求 使用urllib 差异在于标题的顺序。dnt的大小写差异实际上不是问题所在

因此,我能够通过以下原始请求成功提出请求:

GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0



因此,
主机
头已发送到
用户代理
上方。因此,如果您想继续使用请求。考虑使用OrdRead DICT确保标题的排序。< /P> < P>在调试之后,由于@ TuanGeek的答案,我们发现了请求库似乎来自于处理CyrdFLARE请求部分的DNS问题,这个问题的简单解决方案是直接连接到主机IP:

import requests
from collections import OrderedDict
from requests import Session
import socket

# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]

s = Session()
headers = OrderedDict({
    'Accept-Encoding': 'gzip, deflate, br',
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)
现在,当使用httplib HTTPX时,这个修复程序不起作用,但是我发现了问题的根源

这个问题来自h11库(HTTPX用来处理HTTP/1.1请求),虽然urllib会自动修复头的字母大小写,但h11采取了不同的方法,将每个头都小写。虽然从理论上讲,这不应该引起任何问题,因为服务器应该以不区分大小写的方式处理头(在很多情况下,它们是这样做的),但事实是HTTP很难处理™️ 而Cloudflare等服务不尊重RFC2616,并且要求标题正确大写

关于资本化的讨论在h11已经进行了一段时间:

“最近”也开始出现在HTTPX的回购协议上:

现在,对于Cloudflare和HTTPX之间的问题,一个不令人满意的答案是,在h11方面完成某些事情之前(或者直到Cloudflare奇迹般地开始尊重RFC2616),HTTPX和Cloudflare处理头大小写的方式不能有太多改变


使用不同的HTTPLIB,如aiohttp或requests futures,尝试自己用h11分叉和修补标题大写,或者等待并希望h11团队正确处理该问题。

我知道
requests
使用
urlib3
在引擎盖下执行连接。也许值得探究这两个库中这种连接的不同之处(
urllib
vs
urllib3
)。我试着看自己,但这超出了我的熟悉程度。我猜这与请求如何设置请求有关。它在幕后使用urllib,但负责幕后的大部分脏活(这解释了为什么我必须用urllib解压和解码响应,而请求会自动执行)。可能是特定编码或setti
GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: close
Upgrade-Insecure-Requests: 1
Dnt: 1


GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0


import requests
from collections import OrderedDict
from requests import Session
import socket

# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]

s = Session()
headers = OrderedDict({
    'Accept-Encoding': 'gzip, deflate, br',
    'Host': "grimaldis.myguestaccount.com",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)