用std:：thread优化方阵乘法我尝试用C++中的STD:：线程来实现矩阵乘法。目前，我的内核代码看起来像 void multiply(const int* a, const int* b, int* c, int rowLength, int start) { for (auto i = start; i < rowLength; i += threadCount) { const auto rowI = i * rowLength; for (auto j = 0; j < rowLength; j++) { auto result = 0; const auto rowJ = j * rowLength; for (auto k = 0; k < rowLength; k++) { result += a[rowI + k] * b[rowJ + k]; } c[rowI + j] = result; } } } void乘法（常量int*a、常量int*b、int*c、int行长度、int开始）{ 用于（自动i=开始；i_C++_Matrix Multiplication

用std:：thread优化方阵乘法我尝试用C++中的STD:：线程来实现矩阵乘法。目前，我的内核代码看起来像 void multiply(const int* a, const int* b, int* c, int rowLength, int start) { for (auto i = start; i < rowLength; i += threadCount) { const auto rowI = i * rowLength; for (auto j = 0; j < rowLength; j++) { auto result = 0; const auto rowJ = j * rowLength; for (auto k = 0; k < rowLength; k++) { result += a[rowI + k] * b[rowJ + k]; } c[rowI + j] = result; } } } void乘法（常量int*a、常量int*b、int*c、int行长度、int开始）{ 用于（自动i=开始；i

c++

用std:：thread优化方阵乘法我尝试用C++中的STD:：线程来实现矩阵乘法。目前，我的内核代码看起来像 void multiply(const int* a, const int* b, int* c, int rowLength, int start) { for (auto i = start; i < rowLength; i += threadCount) { const auto rowI = i * rowLength; for (auto j = 0; j < rowLength; j++) { auto result = 0; const auto rowJ = j * rowLength; for (auto k = 0; k < rowLength; k++) { result += a[rowI + k] * b[rowJ + k]; } c[rowI + j] = result; } } } void乘法（常量int*a、常量int*b、int*c、int行长度、int开始）{ 用于（自动i=开始；i,c++,matrix-multiplication,C++,Matrix Multiplication,如你们所见，我将矩阵A与已经转置的矩阵B相乘（这是在输入过程中完成的）。目前，我正在尝试使用一维方法。我可以用我当前的代码进行任何优化吗？您可以像标题中所说的那样使用线程，但您的代码实际上没有这样做。地方也将扮演重要角色。另外，要明智地计划，因为线程可能不会有帮助，除非这些矩阵非常大（你也没有提到这个细节）。是的，矩阵非常大。我忘了提到这个函数是用std:：thread（）调用的，所以有线程（例如，我使用N个线程，所以在循环I=0…N中调用它，其中我是起始变量。然后它随threadCount移动

如你们所见，我将矩阵A与已经转置的矩阵B相乘（这是在输入过程中完成的）。目前，我正在尝试使用一维方法。我可以用我当前的代码进行任何优化吗？

您可以像标题中所说的那样使用线程，但您的代码实际上没有这样做。地方也将扮演重要角色。另外，要明智地计划，因为线程可能不会有帮助，除非这些矩阵非常大（你也没有提到这个细节）。是的，矩阵非常大。我忘了提到这个函数是用std:：thread（）调用的，所以有线程（例如，我使用N个线程，所以在循环I=0…N中调用它，其中我是起始变量。然后它随threadCount移动（在上面的代码中，它是第二行）。外部循环将降低硬件预取的能力，因为它是非单位步长。您最好剥离行长/线程数大小的线程数块。由于这是整数矩阵乘法，大多数现代芯片不会给您多问题或三元组指令（即muladd）因此，不要期望看到任何接近芯片的浮点峰值额定值的东西。你得到的整数峰值的百分比是多少？这个问题可以被转换为浮点还是双精度？你的目标处理器是什么？缓存层次结构是什么样的？