C++ 手动编写多线程循环-可伸缩性欠佳_C++_Multithreading_Boost Thread_Interlocked Increment

C++ 手动编写多线程循环-可伸缩性欠佳

c++ multithreading

C++ 手动编写多线程循环-可伸缩性欠佳,c++,multithreading,boost-thread,interlocked-increment,C++,Multithreading,Boost Thread,Interlocked Increment,我编写了这个测试应用程序：它从0到9999进行迭代，对于范围内的每个整数，它计算一些无用但计算密集的函数。因此，程序输出函数值之和。为了让它在多个线程上运行，我使用InterlockedIncrement-如果在增量之后迭代次数，结果可以用我的CPU支持该技术这一事实来解释在Turbo CORE模式下，AMD Phenom™ II X6 1090T移位频率速度从六核上的3.2GHz提高到三核上的3.6GHz 因此，单线程模式和多线程模式下的时钟频率不同。我习惯于在不支持TurboCore的C

我编写了这个测试应用程序：它从0到9999进行迭代，对于范围内的每个整数，它计算一些无用但计算密集的函数。因此，程序输出函数值之和。为了让它在多个线程上运行，我使用InterlockedIncrement-如果在增量之后迭代次数，结果可以用我的CPU支持该技术这一事实来解释

在Turbo CORE模式下，AMD Phenom™ II X6 1090T移位频率速度从六核上的3.2GHz提高到三核上的3.6GHz

因此，单线程模式和多线程模式下的时钟频率不同。我习惯于在不支持TurboCore的CPU上使用多线程。下面是一张显示测试结果的图像

AMD OverDrive实用程序窗口（允许打开/关闭TurboCore）
运行1个线程且TurboCore处于打开状态
TurboCore关闭时运行1个线程
有5个线程的运行

非常感谢那些试图提供帮助的人。

您是否尝试过预先拆分任务，而不是让线程访问共享状态？感谢阅读本文。共享状态是指g_globalCounter变量吗？不，我没试过。我觉得先到先得的服务可以提供最佳的负载平衡。我尝试过增加

value=sqrt（value*log（1.0+value））的数量迭代10次（这将减少迭代计数器上的争用）。结果是80.43s和358s，所以我不认为共享状态导致了这种情况。下一个坏猜测是如何实现sqrt/log的？FPU争用，也许？@j_random_hacker，我已将函数接口更改为“void ThreadProc（共享的\u ptr数据，未签名的iStart，未签名的iEnd）”。没有加速，还是旧的36秒对8秒。谢谢（我想你也摆脱了InterlockedIncrement（）
）。我认为消除这一可能的原因是值得的。但是在这种情况下，我很困惑！AFAIK每个CPU都有自己的FPU（和SSE寄存器），所以我看不出会像Martin建议的那样出现FP争用。你有其他程序在后台运行吗？如果您同时启动一个单线程程序的5个实例，而该程序只进行g_maxIter/5次迭代，那么它们是否都比只启动1次要花费更长的时间？
#include <boost/thread.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/make_shared.hpp>
#include <vector>
#include <windows.h>
#include <iostream>
#include <cmath>

using namespace std;
using namespace boost;

struct sThreadData
{
  sThreadData() : iterCount(0), value( 0.0 ) {}
  unsigned iterCount;
  double value;
};

volatile LONG g_globalCounter;
const LONG g_maxIter = 10000;

void ThreadProc( shared_ptr<sThreadData> data )
{
  double threadValue = 0.0;
  unsigned threadCount = 0;

  while( true )
  {
    LONG iterIndex = InterlockedIncrement( &g_globalCounter );
    if( iterIndex >= g_maxIter )
      break;

    ++threadCount;

    double value = iterIndex * 0.12345777;
    for( unsigned i = 0; i < 100000; ++i )
      value = sqrt( value * log(1.0 + value) );

    threadValue += value;
  }

  data->value = threadValue;
  data->iterCount = threadCount;
}

int main()
{
  const unsigned threadCount = 1;

  vector< shared_ptr<sThreadData> > threadData;
  for( unsigned i = 0; i < threadCount; ++i )
    threadData.push_back( make_shared<sThreadData>() );

  g_globalCounter = 0;

  DWORD t1 = GetTickCount();
  vector< shared_ptr<thread> > threads;
  for( unsigned i = 0; i < threadCount; ++i )
    threads.push_back( make_shared<thread>( &ThreadProc, threadData[i] ) );

  double sum = 0.0;
  for( unsigned i = 0; i < threadData.size(); ++i )
  {
    threads[i]->join();
    sum += threadData[i]->value;
  }

  DWORD t2 = GetTickCount();
  cout << "T=" << static_cast<double>(t2 - t1) / 1000.0 << "s\n";

  cout << "Sum= " << sum << "\n";
  for( unsigned i = 0; i < threadData.size(); ++i )
    cout << threadData[i]->iterCount << "\n";

  return 0;
}