Python Scrapy的代理池系统,用于临时停止使用慢速/超时代理

Python Scrapy的代理池系统,用于临时停止使用慢速/超时代理,python,proxy,scrapy,Python,Proxy,Scrapy,我一直在四处寻找,试图为Scrapy找到一个合适的共享系统,但我找不到任何我需要/想要的东西 我正在寻找解决方案: 旋转代理 我希望他们在代理之间随机切换,但不要连续两次选择同一个代理。(Scrapoxy有这个) 模拟已知浏览器 模拟Chrome、Firefox、Internet Explorer、Edge、Safari。。。etc(Scrapoxy有此功能) 黑名单慢速代理 如果代理超时或速度较慢,则应通过一系列规则将其列入黑名单。。。(Scrapoxy仅针对实例/启动的数量列入黑名

我一直在四处寻找,试图为Scrapy找到一个合适的共享系统,但我找不到任何我需要/想要的东西

我正在寻找解决方案:

旋转代理
  • 我希望他们在代理之间随机切换,但不要连续两次选择同一个代理。(Scrapoxy有这个)
模拟已知浏览器
  • 模拟Chrome、Firefox、Internet Explorer、Edge、Safari。。。etc(Scrapoxy有此功能)
黑名单慢速代理
  • 如果代理超时或速度较慢,则应通过一系列规则将其列入黑名单。。。(Scrapoxy仅针对实例/启动的数量列入黑名单)

  • 如果代理速度慢(占用x时间),则应将其标记为
    slow
    ,并应使用时间戳和增加计数器

  • 如果代理超时,则应将其标记为
    Fail
    ,并应使用时间戳和增加计数器
  • 如果代理在接收到最后一个slow后15分钟内没有slow,那么计数器和时间戳应该归零,代理将返回到新状态
  • 如果代理在收到最后一次失败后30分钟内没有失败,则计数器和时间戳应归零,代理将返回到新状态
  • 如果代理在1小时内慢了5次,则应将其从池中删除1小时
  • 如果代理超时在1小时内出现5次,则应在1小时内将其列入黑名单
  • 如果代理get在3小时内被阻止两次,则应在12小时内被列入黑名单并标记为坏
  • 如果代理在48小时内两次被标记为坏的,那么它应该通知我(电子邮件、推送子弹…任何东西)

任何人都知道任何此类解决方案(主要功能是将慢速/超时代理列入黑名单…

由于您的轮询规则非常详细,您可以自己编写代码,请参阅下面的代码,其中实现了您规则的某些部分(您必须实现其他部分):


也许这个@TarunLalwani很接近,但还不够好,因为它会在一个超时上阻塞,我描述的方法是尝试限制他们被黑名单的时间,因为有时候代理/网站可能会很慢,在被黑名单之前应该是x次。我不确定项目是否会符合你们100%的要求您需要找到一个最接近的匹配项,然后根据您的需要对其进行自定义查看ID?它可能会提供一些初始帮助,或者您可以作为指导来编写您自己喜欢的代理旋转刮擦中间件。我以前实现过类似的功能。基本思想是从中刮擦代理列表并进行筛选(检查是否工作、超时等),然后在它们之间随机轮换。需要定期重新清理和检查列表,偶尔代理会从目标站点中被列入黑名单,因此您需要一种机制从候选代理列表中退出/删除
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import pexpect,time
from random import shuffle

#this func is use to test a single proxy
def test_proxy(ip,port,max_timeout=1):
    child = pexpect.spawn("telnet " + ip + " " +str(port))
    time_send_request=time.time()
    try:
        i=child.expect(["Connected to","Connection refused"], timeout=max_timeout) #max timeout in seconds
    except pexpect.TIMEOUT:
        i=-1
    if i==0:
        time_request_ok=time.time()
        return {"status":True,"time_to_answer":time_request_ok-time_send_request}
    else:
        return {"status":False,"time_to_answer":max_timeout}


#this func is use to test all the current proxy and update status and apply your custom rules
def update_proxy_list_status(proxy_list):
    for i in range(0,len(proxy_list)):
        print ("testing proxy "+str(i)+" "+proxy_list[i]["ip"]+":"+str(proxy_list[i]["port"]))
        proxy_status = test_proxy(proxy_list[i]["ip"],proxy_list[i]["port"])
        proxy_list[i]["status_ok"]= proxy_status["status"]


        print proxy_status

        #here it is time to treat your own rule to update respective proxy dict

        #~ If a proxy is slow (takes over x time) it should be marked as Slow and a timestamp should be taken and a counter should be increased.
        #~ If a proxy timeout's it should be marked as Fail and a timestamp should be taken and a counter should be increased.
        #~ If a proxy has no slows for 15 minutes after receiving its last slow then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy has no fails for 30 minutes after receiving its last fail then the counter & timestamp should be zeroed and the proxy gets returns back to a fresh state.
        #~ If a proxy is slow 5 times in 1 hour then it should be removed from the pool for 1 hour.
        #~ If a proxy timeout's 5 times in 1 hour then it should be blacklisted for 1 hour
        #~ If a proxy get's blocked twice in 3 hours it should be blacklisted for 12 hours and marked as bad
        #~ If a proxy gets marked as bad twice in 48 hours then it should notify me (email, push bullet... anything)        

        if proxy_status["status"]==True:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up eFIRSTtc...)
            #...
            pass
        else:
            #modify proxy dict with your own rules (adding timestamp, last check time, last down, last up etc...)
            #...
            pass        

    return proxy_list


#this func select a good proxy and do the job
def main():

    #first populate a proxy list | I get those example proxies list from http://free-proxy.cz/en/
    proxy_list=[
        {"ip":"167.99.2.12","port":8080}, #bad proxy
        {"ip":"167.99.2.17","port":8080},
        {"ip":"66.70.160.171","port":1080},
        {"ip":"192.99.220.151","port":8080},
        {"ip":"142.44.137.222","port":80}
        # [...]
    ]



    #this variable is use to keep track of last used proxy (to avoid to use the same one two consecutive time)
    previous_proxy_ip=""

    the_job=True
    while the_job:

        #here we update each proxy status
        proxy_list = update_proxy_list_status(proxy_list)

        #we keep only proxy considered as ok
        good_proxy_list = [d for d in proxy_list if d['status_ok']==True]

        #here you can shuffle the list
        shuffle(good_proxy_list)

        #select a proxy (not same last previous one)
        current_proxy={}
        for i in range(0,len(good_proxy_list)):
            if good_proxy_list[i]["ip"]!=previous_proxy_ip:
                previous_proxy_ip=good_proxy_list[i]["ip"]
                current_proxy=good_proxy_list[i]
                break

        #use this selected proxy to do the job
        print ("the current proxy is: "+str(current_proxy))

        #UPDATE SCRAPY PROXY

        #DO THE SCRAPY JOB
        print "DO MY SCRAPY JOB with the current proxy settings"

        #wait some seconds
        time.sleep(5)

main()