Python 不被阻止的网页抓取_Python_Selenium_Selenium Chromedriver

Python 不被阻止的网页抓取

python selenium

Python 不被阻止的网页抓取,python,selenium,selenium-chromedriver,Python,Selenium,Selenium Chromedriver,我读了很多关于这个话题的帖子，也尝试了本文的一些建议，但我还是被挡在了后面 IP轮换：我使用VPN，经常更改IP（但显然不是在脚本期间）设置一个真实的用户代理：实现了一个没有运气的假用户代理设置其他请求头：尝试使用SeleniumWire，但如何同时使用它而不是2 在您的请求之间设置随机间隔：完成，但无论如何，目前我甚至无法访问起始主页设置引用者：与3相同使用无头浏览器：没有线索避免蜜罐陷阱：与4相同 10：无关紧要我想浏览的网站是：没有Selenium：它可以

我读了很多关于这个话题的帖子，也尝试了本文的一些建议，但我还是被挡在了后面

IP轮换：我使用VPN，经常更改IP（但显然不是在脚本期间）

设置一个真实的用户代理：实现了一个没有运气的假用户代理

设置其他请求头：尝试使用SeleniumWire，但如何同时使用它而不是2

在您的请求之间设置随机间隔：完成，但无论如何，目前我甚至无法访问起始主页

设置引用者：与3相同

使用无头浏览器：没有线索

避免蜜罐陷阱：与4相同

10：无关紧要

我想浏览的网站是：

没有Selenium：它可以顺利地转到包含一些游戏及其赔率的页面，我可以从这里导航

使用Selenium：页面显示“Winamax est actuellement en maintenance”消息，没有游戏，也没有赔率

尝试执行这段代码可能会很快被阻止：

from selenium import webdriver
import time
from time import sleep
import json

driver = webdriver.Chrome(executable_path="chromedriver")
driver.get("https://www.winamax.fr/paris-sportifs/")   #I'm even blocked here now !!!

toto = driver.page_source.splitlines()
titi = {}
matchez = []
matchez_detail = []
resultat_1 = {}
resultat_2 = {}
taratata = 1
comptine = 1

for tut in toto:
    if tut[0:53] == "<script type=\"text/javascript\">var PRELOADED_STATE = ": titi = json.loads(tut[53:tut.find(";var BETTING_CONFIGURATION = ")])

for p_id in titi.items():
    if p_id[0] == "sports": 
        for fufu in p_id:
            if isinstance(fufu, dict):
                for tyty in fufu.items():
                    resultat_1[tyty[0]] = tyty[1]["categories"]

for p_id in titi.items():
    if p_id[0] == "categories": 
        for fufu in p_id:
            if isinstance(fufu, dict):
                for tyty in fufu.items():
                    resultat_2[tyty[0]] = tyty[1]["tournaments"]

for p_id in resultat_1.items():
    for tgtg in p_id[1]:
        for p_id2 in resultat_2.items():
            if str(tgtg) == p_id2[0]: 
                for p_id3 in p_id2[1]:
                    matchez.append("https://www.winamax.fr/paris-sportifs/sports/"+str(p_id[0])+"/"+str(tgtg)+"/"+str(p_id3))

for alisson in matchez:
    print("compet " + str(taratata) + "/" + str(len(matchez)) + " : " + alisson)
    taratata = taratata + 1
    driver.get(alisson)
    sleep(1)
    elements = driver.find_elements_by_xpath("//*[@id='app-inner']/div/div[1]/span/div/div[2]/div/section/div/div/div[1]/div/div/div/div/a")
    for elm in elements:
        matchez_detail.append(elm.get_attribute("href"))

for mat in matchez_detail:
    print("match " + str(comptine) + "/" + str(len(matchez_detail)) + " : " + mat)
    comptine = comptine + 1
    driver.get(mat)
    sleep(1)
    elements = driver.find_elements_by_xpath("//*[@id='app-inner']//button/div/span")
    for elm in elements:
        elm.click()
        sleep(1) # and after my specific code to scrape what I want

从selenium导入webdriver
导入时间
从时间上导入睡眠
导入json
driver=webdriver.Chrome（可执行文件
驱动程序。获取（“https://www.winamax.fr/paris-sportifs/）我现在甚至被挡在这里了！！！
toto=driver.page_source.splitlines（）
titi={}
matchez=[]
matchez_细节=[]
结果_1={}
结果_2={}
塔拉塔塔=1
康普汀=1
对于图坦卡蒙：
如果tut[0:53]=“var-preload_-STATE=”：titi=json.loads（tut[53:tut.find（“；var-botting_-CONFIGURATION=”））
对于titi.items（）中的p_id：
如果p_id[0]=“体育”：
对于p_id中的fufu：
如果存在（符符，dict）：
对于fufu.items（）中的tyty：
结果1[tyty[0]]=tyty[1][“类别”]
对于titi.items（）中的p_id：
如果p_id[0]=“类别”：
对于p_id中的fufu：
如果存在（符符，dict）：
对于fufu.items（）中的tyty：
结果2[tyty[0]]=tyty[1][“锦标赛”]
对于resultat_1.items（）中的p_id：
对于p_id[1]中的tgtg：
对于resultat_2.items（）中的p_id2：
如果str（tgtg）=p_id2[0]：
对于p_id2[1]中的p_id3：
matchez.append（“https://www.winamax.fr/paris-sportifs/sports/“+str（p_-id[0]）+”/“+str（tgtg）+”/“+str（p_-id[3]）”
对于matchez的alisson：
打印（“compet”+str（塔拉塔塔）+“/”+str（len（matchez））+”：“+alisson）
塔拉塔塔=塔拉塔塔+1
司机，快去（艾莉森）
睡眠（1）
elements=driver。通过xpath（“/*[@id='app-internal']/div/div[1]/span/div/div[2]/div/section/div/div/div[1]/div/div/div/a”）查找元素
对于元素中的elm：
matchez_detail.append（elm.get_属性（“href”））
对于matchez_详图中的垫子：
打印（“匹配”+str（comptine）+“/”+str（len（matchez_detail））+”：“+mat）
康普汀=康普汀+1
司机，拿（垫子）
睡眠（1）
elements=driver。通过xpath（“//*[@id='app-inner']///button/div/span”）查找元素
对于元素中的elm：
elm.click（）
睡眠（1）#并在我的特定代码之后刮取我想要的东西

我建议使用请求，我不认为有理由使用selenium，因为您说过请求可以工作，并且请求可以与几乎任何站点一起工作，只要您使用适当的头，您可以通过查看chrome或Firefox中的开发者控制台和请求头来查看所需的头