C++ 提高线程开销_C++_Multithreading_Performance_Threadpool_Boost Thread

C++ 提高线程开销

c++ multithreading performance

C++ 提高线程开销,c++,multithreading,performance,threadpool,boost-thread,C++,Multithreading,Performance,Threadpool,Boost Thread,在下面的简单程序中，我发现boost线程开销有三个数量级的定时开销。是否有任何方法可以减少此开销并加速fooThread（）调用 #include <iostream> #include <time.h> #include <boost/thread.hpp> #include <boost/date_time.hpp> typedef uint64_t tick_t; #define rdtscll(val) do { \ unsigne

在下面的简单程序中，我发现boost线程开销有三个数量级的定时开销。是否有任何方法可以减少此开销并加速

fooThread（）

调用

#include <iostream>
#include <time.h>
#include <boost/thread.hpp>
#include <boost/date_time.hpp>
typedef uint64_t tick_t;
#define rdtscll(val) do { \
    unsigned int __a,__d; \
    __asm__ __volatile__("rdtsc" : "=a" (__a), "=d" (__d)); \
        (val) = ((unsigned long long)__a) | (((unsigned long long)__d)<<32); \
    } while(0)


class baseClass {
 public:
   void foo(){
             //Do nothing 
        }
       void threadFoo(){
          threadObjOne = boost::thread(&baseClass::foo, this);
              threadObjOne.join();
   }

 private:
   boost::thread threadObjOne;
 };

int main(){
   std::cout<< "main startup"<<std::endl; 
   baseClass baseObj; 
   tick_t startTime,endTime;
       rdtscll(startTime);
   baseObj.foo();
   rdtscll(endTime);
   std::cout<<"native foo() call takes "<< endTime-startTime <<" clock cycles"<<std::endl;
   rdtscll(startTime);
   baseObj.threadFoo();
       rdtscll(endTime);
       std::cout<<"Thread foo() call takes "<< endTime-startTime <<" clock cycles"<<std::endl;  
  }

您真正想要的是一个线程池：

#include "threadpool.hpp"

int main()
{
    boost::threadpool::pool threadpool(8);  // I have 4 cpu's
                                            // Might be overkill need to time
                                            // to get exact numbers but depends on
                                            // blocking and other factors.

    for(int loop = 0;loop < 100000; ++loop)
    {
        // schedule 100,000 small tasks to be run in the 8 threads in the pool
        threadpool.schedule(task(loop));
    }

    // Destructor of threadpool
    // will force the main thread to wait
    // for all tasks to complete before exiting
}

#包括“threadpool.hpp”
int main（）
{
boost:：threadpool:：pool threadpool（8）；//我有4个cpu
//可能有点过分，需要时间
//获得准确的数字，但取决于
//阻塞和其他因素。
for（int循环=0；循环<100000；++loop）
{
//安排100000个小任务在池中的8个线程中运行
调度（任务（循环））；
}
//线程池析构函数
//将强制主线程等待
//退出前要完成的所有任务
}

首先调用boost线程，最后调用native线程，并告诉我们结果。看起来就像你有一个上下文开关，在~3 GHz的频率下~30 mln周期大约是10毫秒-时间粒度非常相似。性能下降到4个数量级：主启动线程foo（）调用需要13418779个时钟周期本机foo（）调用需要2197个时钟周期我们知道启动线程是有成本的。您需要向操作系统请求堆栈空间，并设置一整套其他内容。正如Lyth所说，10毫秒并没有那么糟糕。因此，数量级是不相关的，因为您不会在每次希望并行运行函数时创建线程。加快速度的方法是创建一个线程，然后调用foo（）10亿次，并将其成本与通常调用它10亿次进行比较。线程设置的成本是微不足道的。我明白你的观点，洛基。然而，这种可以忽略不计的线程开销将扼杀应用程序的并行处理，因为在我的例子中，每个并行函数体都非常简单。因此，顺序调用函数胜过多线程实现。如果没有其他方法来减少这种开销，我想我应该忘记多线程。是的，不正确地使用线程通常会降低应用程序的速度，而不是加快应用程序的速度。但这并不妨碍我们尝试让您正确使用它并获得速度提升。关键不是为每个函数启动一个新线程。无论如何，您只希望启动少量线程（确切的数量有所不同，但（1.5->2）*是一个很好的起点）。然后在每个线程之间划分函数。因此，每个线程将按顺序执行一组函数。通过使用线程拉动（正如@KillianDS所建议的），您可以动态地分配负载。经过测试，Threadpool速度更快，但仍然比本机调用慢3个数量级。

#include "threadpool.hpp"

int main()
{
    boost::threadpool::pool threadpool(8);  // I have 4 cpu's
                                            // Might be overkill need to time
                                            // to get exact numbers but depends on
                                            // blocking and other factors.

    for(int loop = 0;loop < 100000; ++loop)
    {
        // schedule 100,000 small tasks to be run in the 8 threads in the pool
        threadpool.schedule(task(loop));
    }

    // Destructor of threadpool
    // will force the main thread to wait
    // for all tasks to complete before exiting
}