C++ 使用std:：async慢于非async方法来填充向量_C++_C++11_Stdasync

C++ 使用std:：async慢于非async方法来填充向量

c++ c++11

C++ 使用std:：async慢于非async方法来填充向量,c++,c++11,stdasync,C++,C++11,Stdasync,我正在试验std:：async来填充向量。其背后的思想是使用多线程来节省时间。然而，运行一些基准测试时，我发现我的非异步方法更快 #include <algorithm> #include <vector> #include <future> std::vector<int> Generate(int i) { std::vector<int> v; for (int j = i; j < i + 10; ++j

我正在试验std:：async来填充向量。其背后的思想是使用多线程来节省时间。然而，运行一些基准测试时，我发现我的非异步方法更快

#include <algorithm>
#include <vector>
#include <future>

std::vector<int> Generate(int i)
{
    std::vector<int> v;
    for (int j = i; j < i + 10; ++j)
    {
        v.push_back(j);
    }
    return v;
}

#include显示异步方法比非异步方法慢71倍。我做错了什么？
首先，您没有强制std:：async
异步工作（您需要指定std:：launch:：async
策略才能这样做）。其次，异步创建10
int
s的std:：vector
，这有点过分了。这不值得。请记住-使用更多线程并不意味着您将看到性能优势！创建线程（甚至使用线程池）会带来一些开销，在本例中，这似乎使异步运行任务的好处相形见绌
谢谢@NathanOliver；>
 std:：async
有两种操作模式：
std:：launch:：async
std:：launch:：deferred
在本例中，您调用了std:：async
，但没有指定任何一个，这意味着可以选择其中一个<代码>标准：：启动：：延迟

基本上意味着在调用线程上完成工作。因此，

std:：async

返回一个

future

，使用

std:：launch:：deferred

，您请求的操作将不会执行，直到您调用

。获取future
。在某些情况下，它可能会很方便，但这可能不是你想要的
即使您指定了std:：launch:：async
，您也需要意识到这会启动一个新的执行线程来执行您请求的操作。然后，它必须创建一个未来
，并使用某种从线程到未来的信号，让您知道所请求的计算何时完成
所有这些都增加了相当大的开销——从微秒到毫秒左右，取决于操作系统、CPU等
因此，为了让异步执行有意义，异步执行的“东西”通常至少需要几十毫秒（数百毫秒可能是更合理的下限）。我不会太在意确切的截止日期，但它需要一些时间
因此，异步填充数组可能只有在数组比这里要处理的大得多的情况下才有意义
但是，对于内存填充，您很快就会遇到另一个问题：大多数CPU比主存快得多，如果您所做的只是写入内存，那么很有可能一个线程已经饱和了内存路径，因此即使异步执行任务，也只会获得一点好处，而且还是很容易造成减速
异步操作的理想情况是，一个线程内存严重受限，但另一个线程（例如）读取少量数据，并对少量数据执行大量计算。在这种情况下，计算线程主要对缓存中的数据进行操作，因此它不会妨碍内存线程的工作。
有多个因素导致多线程代码的执行速度比单线程代码慢（很多）
你的数组太小了
多线程通常对特别小的数据集的影响可以忽略甚至没有。在两个版本的代码中，您都会生成2000个整数，而每个逻辑线程（因为std:：async
通常是以线程池的形式实现的，可能与软件线程不同）只会生成10个整数。每10个整数缠绕一个线程的成本抵消了并行生成这些整数的好处
如果每个线程都负责（比如）10000个整数，您可能会看到性能提高，但您可能会遇到不同的问题：
所有的代码都被固有的串行进程所限制
两个版本的代码都将生成的整数复制到宿主向量中。如果生成这些整数本身就是一个耗时的过程，那将是一回事，但在您的例子中，这可能只是一个生成每个整数的小而快速的程序集的问题
因此，将每个整数复制到最终向量中的动作可能并不比生成每个整数的速度快，这意味着要完成的“工作”中有相当大的一部分完全是串行的，从而破坏了多线程代码的全部目的
修正代码
编译器的工作非常出色，因此在尝试修改代码时，我只能勉强获得比串行代码更快的多线程代码。多次执行的结果各不相同，因此我的总体评估是，这种代码不适合多线程处理
但我想到的是：
#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>

//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;

std::vector<int> Generate(int i) {
    std::vector<int> v;
    for (int j = i; j < i + BLOCK_SIZE; ++j) {
        v.push_back(j);
    }
    return v;
}

void asynchronous_attempt() {
    std::vector<std::future<void>> futures;
    //#2: Preallocated Vector
    std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
    auto it = res.begin();
    for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
    {
      futures.push_back(std::async(
        [it](int i) { 
            auto vec = Generate(i); 
            //#3 Copying done multithreaded
            std::copy(vec.begin(), vec.end(), it + i);
        }, i));
    }
    
    for (auto &&f : futures) {
        f.get();
    }
}

void serial_attempt() {
    //#4 Changes here to show fair comparison
    std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
    auto it = res.begin();
    for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
        auto vec = Generate(i);
        it = std::copy(vec.begin(), vec.end(), it);
    }
}

int main() {
    using clock = std::chrono::steady_clock;

    std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
    auto begin = clock::now();
    asynchronous_attempt();
    auto end = clock::now();
    std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
    begin = clock::now();
    serial_attempt();
    end = clock::now();
    std::cout << "Duration of Serial Attempt:        " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}

假定这是一个在线编译器（），我敢打赌，多线程代码可能会在专用机器上胜出，但我认为这至少证明了性能的提高，我们至少在这两种方法之间是相称的。
以下是我所做的更改，代码中有ID：
我们已经显著增加了生成的整数的数量，以迫使线程执行实际的有意义的工作，而不是陷入操作系统级的内务管理
向量的大小已预先分配。不再频繁调整大小
既然已经预先分配了空间，我们就可以多线程地进行复制，而不是以后以串行方式进行复制
我们必须更改串行代码，以便它也预先分配+份数，以便进行公平比较
现在，我们已经确保所有的代码确实是并行运行的，尽管这并不意味着比以前有实质性的改进
std::vector<int> res;
for (int i = 0; i < 200; i+=10)
{
   auto vec = Generate(i);
   res.insert(std::end(res), std::begin(vec), std::end(vec));
}

#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>

//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;

std::vector<int> Generate(int i) {
    std::vector<int> v;
    for (int j = i; j < i + BLOCK_SIZE; ++j) {
        v.push_back(j);
    }
    return v;
}

void asynchronous_attempt() {
    std::vector<std::future<void>> futures;
    //#2: Preallocated Vector
    std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
    auto it = res.begin();
    for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
    {
      futures.push_back(std::async(
        [it](int i) { 
            auto vec = Generate(i); 
            //#3 Copying done multithreaded
            std::copy(vec.begin(), vec.end(), it + i);
        }, i));
    }
    
    for (auto &&f : futures) {
        f.get();
    }
}

void serial_attempt() {
    //#4 Changes here to show fair comparison
    std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
    auto it = res.begin();
    for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
        auto vec = Generate(i);
        it = std::copy(vec.begin(), vec.end(), it);
    }
}

int main() {
    using clock = std::chrono::steady_clock;

    std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
    auto begin = clock::now();
    asynchronous_attempt();
    auto end = clock::now();
    std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
    begin = clock::now();
    serial_attempt();
    end = clock::now();
    std::cout << "Duration of Serial Attempt:        " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}

Theoretical # of Threads: 2
Duration of Multithreaded Attempt:  361149213ns
Duration of Serial Attempt:         364785676ns