在python中使用多线程时如何获得更快的速度_Python_Multithreading_Post_Tcp

在python中使用多线程时如何获得更快的速度

python multithreading post tcp

在python中使用多线程时如何获得更快的速度,python,multithreading,post,tcp,Python,Multithreading,Post,Tcp,现在我正在研究如何尽快从网站上获取数据。为了获得更快的速度，我正在考虑使用多线程。下面是我用来测试多线程和简单post之间差异的代码 import threading import time import urllib import urllib2 class Post: def __init__(self, website, data, mode): self.website = website self.data = data

现在我正在研究如何尽快从网站上获取数据。为了获得更快的速度，我正在考虑使用多线程。下面是我用来测试多线程和简单post之间差异的代码

import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
        self.mode = mode

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()



        print "OK"

if __name__ == "__main__":

    current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
                        "Simple")

    #save the time before post data
    origin_time = time.time()

    if(current_post.mode == "Multiple"):

        #multithreading POST

        for i in range(0, 10):
           thread = threading.Thread(target = current_post.post)
           thread.start()
           thread.join()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

    if(current_post.mode == "Simple"):

        #simple POST

        for i in range(0, 10):
            current_post.post()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

正如您所看到的，这是一个非常简单的代码。首先我将模式设置为“简单”，我可以得到时间间隔：50s（可能我的速度有点慢：（）。然后我将模式设置为“多个”，我得到了时间间隔：35。从这一点我可以看出，多线程实际上可以提高速度，但结果并不像我想象的那么好。我想获得更快的速度

通过调试，我发现程序主要阻塞在以下行：

open\u url=urllib2.urlopen（req，self.data）

，这行代码从指定的网站发布和接收数据需要花费大量时间。我想通过添加

time.sleep（）可能可以加快速度

并在

urlopen

函数中使用多线程，但我不能这样做，因为这是python自己的函数

如果不考虑服务器阻止post速度的可能限制，我还能做些什么来获得更快的速度？或者我可以修改的任何其他代码？thx很多！

DNS查找需要时间。对此你无能为力。大延迟是首先使用多线程的一个原因-多个查找广告站点get/post然后可以同时发生

转储sleep（）-这没有帮助。

请记住，在Python中，多线程可以“提高速度”的唯一情况是当您有这样一个I/O严重受限的操作时。否则，多线程不会提高“速度”，因为它不能在多个CPU上运行（不，即使您有多个内核，python也不能这样工作）。当您希望两件事情同时完成时，应该使用多线程，而不是当您希望两件事情并行时（即两个进程分别运行）

现在，您实际执行的操作实际上不会提高任何单个DNS查找的速度，但它允许在等待其他人的结果时发出多个请求，但您应该注意执行的次数，否则您只会使响应时间比现在更糟糕

另外，请停止使用urllib2，并使用请求：

您所做的最大错误是调用

thread.start（）

和

thread.join（）的方式，这对吞吐量的影响最大：
每次通过循环，您都会创建一个线程，启动它，然后等待它完成，然后再转到下一个线程
您可能应该做的是：
threads = []

# start all of the threads
for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   threads.append(thread)

# now wait for them all to finish
for thread in threads:
   thread.join()

在许多情况下，python的线程并不能很好地提高执行速度…有时，它会使执行速度变得更糟。有关更多信息，请参阅/。本演示内容非常丰富，我强烈建议所有考虑线程化的人使用它
尽管David Beazley的演讲解释了网络流量改善了Python线程模块的调度，但您应该使用。我将此作为一个选项包含在您的代码中（请参阅我答案的底部）
在我的一台旧机器（Python 2.6.6）上运行此功能：
我同意TokenMacGuy的评论，上面的数字包括将.join（）
移动到另一个循环。正如您所看到的，python的多处理速度明显快于线程

线程在python中是一个坏主意，它很容易受到瓶颈限制，并且可能会被GIL困住，请尝试多处理。@JakobBowyer:线程是这里的一个实现细节，真正的重点是打开多个连接。python中线程的GIL方面在这里没有任何作用。@nightcracker，你真的应该仔细阅读GIL和threading在发表这样的声明之前…从这里开始：我甚至都没有往下看那么远。开始后再加入：（这是一个渐进式的改进，但不管python现有的线程是什么，都很糟糕。我们应该推荐多处理；请看我的答案。@Mike：这根本不是一个渐进式的改进；使用MarkZar提供的代码，它将我测试中的运行时间从大约20秒提高到了不到半秒。这是有意义的，因为ttp使用最少的CPU，但对网络延迟高度敏感，因此使用线程处理
而不是多处理
是一个完全合理的解决方案。如果使用保持活动的http客户端，这将加倍（在我的固定线程测试中，urlib3
比urlib2
快约30%，除此之外没有任何改进），这在进程之间是不可用的。@TokenMacGuy，python中的HTTP在解析查询时会占用大量的CPU。正如David Beazley的演示非常清楚地表明的那样，这真的不是重点。python中线程之间没有好的调度解决方案……正如您所看到的，多处理比python thr快得多eads@user1121352，没错……我用数据来证明多处理与线程的合理性……我不仅使用了他的PresentationX，而且我只是弄不明白为什么时间。睡眠（）是无用的。事实上，它在转储sleep（）
后也能很好地工作，但它如何在没有sleep（）的情况下实现多线程呢
？python是否会自动以随机间隔运行不同的线程？如果是，那么sleep（）有什么用处功能？这不是无用的，只是不合适。使用睡眠-有负载。“打开泵后，至少等待十秒钟，直到压力稳定，然后再打开进料阀”。thx很多。多处理是个好主意，它确实比我的计算机上的多线程快一点。thx你们所有人。我学到了很多从这个问题来看，@MarkZar，我想说，速度提高33%不仅仅是稍微快一点，但不管怎样，我还是希望你能做到
threads = []

# start all of the threads
for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   threads.append(thread)

# now wait for them all to finish
for thread in threads:
   thread.join()

current_post.mode == "Process"  (multiprocessing)  --> 0.2609 seconds
current_post.mode == "Multiple" (threading)        --> 0.3947 seconds
current_post.mode == "Simple"   (serial execution) --> 1.650 seconds

from multiprocessing import Process
import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either:
        #   "Simple"      (Simple POST)
        #   "Multiple"    (Multi-thread POST)
        #   "Process"     (Multiprocessing)
        self.mode = mode
        self.run_job()

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()

        #print "OK"

    def run_job(self):
        """This was refactored from the OP's code"""
        origin_time = time.time()
        if(self.mode == "Multiple"):

            #multithreading POST
            threads = list()
            for i in range(0, 10):
               thread = threading.Thread(target = self.post)
               thread.start()
               threads.append(thread)
            for thread in threads:
               thread.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Process"):

            #multiprocessing POST
            processes = list()
            for i in range(0, 10):
               process = Process(target=self.post)
               process.start()
               processes.append(process)
            for process in processes:
               process.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Simple"):

            #simple POST
            for i in range(0, 10):
                self.post()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)
        return time_interval

if __name__ == "__main__":

    for method in ["Process", "Multiple", "Simple"]:
        Post("http://forum.xda-developers.com/login.php", 
            "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
            method
            )