使用多线程和重复代理进行Python解析_Python_Multithreading_Parsing_Proxy_Python Requests

使用多线程和重复代理进行Python解析

python multithreading parsing proxy

使用多线程和重复代理进行Python解析,python,multithreading,parsing,proxy,python-requests,Python,Multithreading,Parsing,Proxy,Python Requests,描述：我试图解析大量数据，但当两个具有相同IP地址的线程工作时，我从服务器收到错误。我的代理数量不足以直接解决这个问题问题：我如何调用线程重复列表中的代理，但检查代理是否繁忙，并让空闲的代理工作我想要的是：我希望模块concurrent.futures.ThreadPoolExecutor给他一个代理列表，以便他重复该列表并检查是否繁忙我尝试的内容：现在我填写了整个范围的代理列表list=list*range//lenlist。我还尝试使用随机选择选择代理我的代码选项卡插入错误： def

描述：我试图解析大量数据，但当两个具有相同IP地址的线程工作时，我从服务器收到错误。我的代理数量不足以直接解决这个问题

问题：我如何调用线程重复列表中的代理，但检查代理是否繁忙，并让空闲的代理工作

我想要的是：我希望模块concurrent.futures.ThreadPoolExecutor给他一个代理列表，以便他重复该列表并检查是否繁忙

我尝试的内容：现在我填写了整个范围的代理列表list=list*range//lenlist。我还尝试使用随机选择选择代理

我的代码选项卡插入错误：

def start_线程：以concurrent.futures.ThreadPoolExecutormax_workers=8作为执行者：执行人。mapget项目，范围500，代理列表

附加问题：最大线程数=处理器线程数？我想尽快完成这项任务

检查图像：

更新文件：

您可以使用定义的enter和exit方法创建一个代理类，该类可以用作上下文，然后您可以将其与with语句一起使用

import threading

PROXIES = {
    "PROXY1" : "1",
    "PROXY2" : "2",
    "PROXY3" : "3",
    "PROXY4" : "4",
    "PROXY5" : "5",
}

class Proxy():
    _Proxies = list()
    cls_lock = threading.Lock()

    def __init__(self,name,proxy):
        self.free = True
        self.name = name 
        self.proxy = proxy
        self.__class__._Proxies.append(self)

    @classmethod
    def Get_Free_Proxy(cls):
        with cls.cls_lock:
            while True:
                for proxy in cls._Proxies:
                    if proxy.free:
                        proxy.free = False
                        return proxy

    def __enter__(self):            
        return self

    def __exit__(self, type, value, traceback):
        self.free = True

for Key,Value in PROXIES.items():
    Proxy(Key,Value)

with Proxy.Get_Free_Proxy() as locked_proxy:
    print(locked_proxy)
    for Proxy in Proxy._Proxies:
        print(Proxy.name,Proxy.free)

print()

for Proxy in Proxy._Proxies:
        print(Proxy.name,Proxy.free)

这将打印：

PROXY1 False
PROXY2 True
PROXY3 True
PROXY4 True
PROXY5 True

PROXY1 True
PROXY2 True
PROXY3 True
PROXY4 True
PROXY5 True

然后，您可以修改代码：

def start_threads():
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        executor.map(get_items_th,range(500))

def get_items_th(begin):
    with Proxy.Get_Free_Proxy() as locked_proxy:
        items=[]
        headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.3'}
        r = requests.get('https://*.com/?query=&start='+str(begin*100)+'&count=100&search_descriptions=0&sort_column=popular&sort_dir=desc&norender=1', headers=headers,timeout=15000, cookies=cookie, proxies=locked_proxy)
        .
        .
        .

评论不用于扩展讨论；这段对话已经结束。

def start_threads():
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        executor.map(get_items_th,range(500))

def get_items_th(begin):
    with Proxy.Get_Free_Proxy() as locked_proxy:
        items=[]
        headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.3'}
        r = requests.get('https://*.com/?query=&start='+str(begin*100)+'&count=100&search_descriptions=0&sort_column=popular&sort_dir=desc&norender=1', headers=headers,timeout=15000, cookies=cookie, proxies=locked_proxy)
        .
        .
        .