C++ C++；11线程与异步性能（VS2013）_C++_Multithreading_C++11_Asynchronous_Visual Studio 2013

C++ C++；11线程与异步性能（VS2013）

c++ multithreading c++11 asynchronous visual-studio-2013

C++ C++；11线程与异步性能（VS2013）,c++,multithreading,c++11,asynchronous,visual-studio-2013,C++,Multithreading,C++11,Asynchronous,Visual Studio 2013,我觉得我错过了一些东西我稍微修改了一些代码，从使用std:：thread改为std:：async，并注意到性能有了显著提高。我编写了一个简单的测试，我假设使用std:：thread运行的测试应该与使用std:：async运行的测试几乎相同 std::atomic<int> someCount = 0; const int THREADS = 200; std::vector<std::thread> threadVec(THREADS); std::vector<

我觉得我错过了一些东西

我稍微修改了一些代码，从使用

std:：thread

改为

std:：async

，并注意到性能有了显著提高。我编写了一个简单的测试，我假设使用

std:：thread

运行的测试应该与使用

std:：async

运行的测试几乎相同

std::atomic<int> someCount = 0;
const int THREADS = 200;
std::vector<std::thread> threadVec(THREADS);
std::vector<std::future<void>> futureVec(THREADS);
auto lam = [&]()
{
    for (int i = 0; i < 100; ++i)
        someCount++;
};

for (int i = 0; i < THREADS; ++i)
    threadVec[i] = std::thread(lam);
for (int i = 0; i < THREADS; ++i)
    threadVec[i].join();

for (int i = 0; i < THREADS; ++i)
    futureVec[i] = std::async(std::launch::async, lam);
for (int i = 0; i < THREADS; ++i)
    futureVec[i].get();

std:：atomic someCount=0；
常量int线程=200；
std:：vector threadVec（线程）；
标准：：矢量未来矢量（线程）；
自动林=[&]（）
{
对于（int i=0；i<100；++i）
someCount++；
};
对于（int i=0；i


我并没有深入分析，但一些初步结果表明，std:：async
代码运行速度快了10倍左右！关闭优化后，结果略有不同，我还尝试切换执行顺序
这是Visual Studio编译器的问题吗？还是有一些我忽略的更深层次的实现问题可以解释这种性能差异？我以为std:：async
是std:：thread
调用的包装器

考虑到这些差异，我想知道在这里获得最佳性能的方法是什么？（创建线程的std:：thread和std:：async不止一种）
如果我想要分离的线程呢？（据我所知，std:：async无法做到这一点）
当您使用async时，您不是在创建新线程，而是重用线程池中可用的线程。创建和销毁线程是一项非常昂贵的操作，在Windows操作系统中需要大约200000个CPU周期。除此之外，请记住，如果线程的数量远远大于CPU核心的数量，则意味着操作系统需要花费更多的时间来创建线程，并对其进行调度，以使用每个核心中可用的CPU时间
更新：
为了看到使用std:：async
所使用的线程数量比使用std:：thread
所使用的线程数量要小得多，我修改了测试代码，以计算以任何一种方式运行时使用的唯一线程ID的数量，如下所示。“我的电脑”中的结果显示此结果：
Number of threads used running std::threads = 200
Number of threads used to run std::async = 4

但是在我的电脑中运行std:：async
的线程数从2个变为4个。这基本上意味着std:：async
将重用线程，而不是每次都创建新的线程。奇怪的是，如果我在for
循环中将100次迭代替换为1000000次，从而增加lambda的计算时间，那么异步线程的数量将增加到9次，但使用原始线程时，它总是给出200次。值得记住的是，“一旦一个线程完成，std:：thread:：id的值可能会被另一个线程重用”
以下是测试代码：
#include <atomic>
#include <vector>
#include <future>
#include <thread>
#include <unordered_set>
#include <iostream>

int main()
{
    std::atomic<int> someCount = 0;
    const int THREADS = 200;
    std::vector<std::thread> threadVec(THREADS);
    std::vector<std::future<void>> futureVec(THREADS);

    std::unordered_set<std::thread::id> uniqueThreadIdsAsync;
    std::unordered_set<std::thread::id> uniqueThreadsIdsThreads;
    std::mutex mutex;

    auto lam = [&](bool isAsync)
    {
        for (int i = 0; i < 100; ++i)
            someCount++;

        auto threadId = std::this_thread::get_id();
        if (isAsync)
        {
            std::lock_guard<std::mutex> lg(mutex);
            uniqueThreadIdsAsync.insert(threadId);
        }
        else
        {
            std::lock_guard<std::mutex> lg(mutex);
            uniqueThreadsIdsThreads.insert(threadId);
        }
    };

    for (int i = 0; i < THREADS; ++i)
        threadVec[i] = std::thread(lam, false); 

    for (int i = 0; i < THREADS; ++i)
        threadVec[i].join();
    std::cout << "Number of threads used running std::threads = " << uniqueThreadsIdsThreads.size() << std::endl;

    for (int i = 0; i < THREADS; ++i)
        futureVec[i] = std::async(lam, true);
    for (int i = 0; i < THREADS; ++i)
        futureVec[i].get();
    std::cout << "Number of threads used to run std::async = " << uniqueThreadIdsAsync.size() << std::endl;
}

#包括
#包括
#包括
#包括
#包括
#包括
int main（）
{
std:：原子计数=0；
常量int线程=200；
std:：vector threadVec（线程）；
标准：：矢量未来矢量（线程）；
std:：无序_集UniqueThreadIDAsync；
std:：无序的_集uniqueThreadsIdsThreads；
std：：互斥互斥；
自动lam=[&]（bool isAsync）
{
对于（int i=0；i<100；++i）
someCount++；
auto-threadId=std:：this_-thread:：get_-id（）；
如果（isAsync）
{
标准：锁紧保护lg（互斥）；
UniqueThreadIDAsync.insert（线程ID）；
}
其他的
{
标准：锁紧保护lg（互斥）；
uniqueThreadsIdsThreads.insert（threadId）；
}
};
对于（int i=0；istd:：cout当所有线程都尝试更新相同的原子someCount时，性能下降也可能与争用相关联（原子线程确保所有并发访问顺序有序）。结果可能是：

线程花费时间等待
但它们无论如何都会消耗CPU周期
因此，您的系统吞吐量被浪费了

使用async（）“就像在一个由thread对象表示的新的执行线程中…”。它并没有说它必须是一个专用线程（因此它可以是——但不一定是——一个线程池）。另一个假设可能是，实现需要一个更宽松的调度，因为没有说线程需要立即执行（但约束条件是，它是在get（）
之前执行的）
建议
基准测试应该考虑到关注点的分离。因此，对于多线程性能，应该尽可能避免线程间同步
请记住，如果超过thread:：hardware\u concurrency（）
线程处于活动状态，那么就不再存在真正的并发性，操作系统必须管理上下文切换的开销
编辑：一些实验反馈（2）
对于100的lam循环，我测量的基准结果不可用，因为与15毫秒的windows时钟分辨率相关的误差大小
Test case            Thread      Async 
   10 000 loop          78          31
1 000 000 loop        2743        2670    (the longer the work, the smaler the difference)
   10 000 + yield()    500        1296    (much more context switches) 

当增加线程的数量时，计时会成比例地发展，但只适用于工作量较短的测试用例。这表明观察到的差异实际上与创建线程时的开销有关