为什么numpy.dot无法在超过2维的ndarray上并行_Numpy_Multidimensional Array_Intel Mkl

为什么numpy.dot无法在超过2维的ndarray上并行

numpy

为什么numpy.dot无法在超过2维的ndarray上并行,numpy,multidimensional-array,intel-mkl,Numpy,Multidimensional Array,Intel Mkl,我注意到numpy.dot（）函数的一个有趣行为。我的Enterprise RedHat 6.7 box有2个Xeon CPU，每个CPU有12个内核。我运行以下代码段，然后在htop 以下代码使用我的服务器上的所有核心： import numpy as np a = np.random.rand(1000, 1000) b = np.random.rand(1000, 5) z = a.dot(b) #or use %timeit a.dot(b) if you use ipython 编

我注意到

numpy.dot（）

函数的一个有趣行为。我的Enterprise RedHat 6.7 box有2个Xeon CPU，每个CPU有12个内核。我运行以下代码段，然后在

htop

以下代码使用我的服务器上的所有核心：

import numpy as np
a = np.random.rand(1000, 1000)
b = np.random.rand(1000, 5)
z = a.dot(b) #or use %timeit a.dot(b) if you use ipython

编辑： 下面是运行上述代码时htop的屏幕截图

但是，只要我像下面那样向

添加一个维度，就只使用了一个核心

import numpy as np
a = np.random.rand(1000, 1000)
b = np.random.rand(1, 1000, 5) #or np.random.rand(n, 1000, 5) where n>=1
z = a.dot(b) #or use %timeit a.dot(b) if you use ipython

编辑： 下面是运行上述代码时htop的屏幕截图

下面是我的python环境在import sys中的配置；系统版本

'2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]'

下面是来自

numpy.show\u config（）的配置信息
以前有人见过这个吗？我倾向于认为这是一个bug，而不是设计上的bug，因为显然在一个维度上还有更多的工作要做。还有，有没有办法强迫numpy.dot的口味化？
提前谢谢
更新：
我找到了一个加快计算速度的方法。请参阅下面的代码片段
import numpy as np
a = np.random.rand(1000, 1000) #in my program a variable
b = np.random.rand(100, 1000, 5) #b is a constant
z1 = a.dot(b)
c=b.swapaxes(0, 1).reshape(1000, 5*100) #the trick is to turn the 3d array into a 2d matrix 
z2 = a.dot(c).reshape(z1.shape) #then reshape the result to the desired shape.
np.allclose(z1, z2) #the results are identical but the computation of z2 is more than 10 times faster than that of z1 on my server. 

然而，我同意从长远来看，我们应该像@hpaulj建议的那样研究numpy代码，并一劳永逸地解决这个问题（如果它是一个bug）
 我认为您必须学习C源代码，例如

cblas_matrixproduct
有很多代码检查2输入数组的维度。最后有一部分处理矩阵*矩阵乘法
(PyArray_NDIM(ap1) == 2 && PyArray_NDIM(ap2) == 2)

计算核心似乎用NPY\u BEGIN\u ALLOW\u线程
和NPY\u END\u ALLOW\u线程

您的MKL代码可能可以替代BLAS
现在的诀窍是找到处理3d阵列的位置。不知何故，它是在片上运行的，因此BLAS代码仍然可以看到一个2d数组
我猜多核的使用是在BLAS/MKL代码中完成的，而不是在numpy
代码中完成的。换句话说，numpy
code（对编译器）说，“在这里使用线程和/或内核是可以的”，但不是“这里是如何根据数组维度在内核之间划分的”

PyArray\u MatrixProduct2
似乎是决定如何调用我前面找到的BLAS
点函数的函数
2 2d矩阵案例似乎通过以下方式处理：
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
        (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
         NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
    return cblas_matrixproduct(typenum, ap1, ap2, out);
}

#如果已定义（有_CBLAS）
if（PyArray\u NDIM（ap1）大小）{
而（it2->索引大小）{
dot（it1->dataptr，is1，it2->dataptr，is2，op，l，ret）；
op+=os；
PyArray_ITER_NEXT（it2）；
}
PyArray_ITER_NEXT（it1）；
PyArray_ITER_重置（it2）；
}
NPY_END_THREADS_DESCR（PyArray_DESCR（ap2））；

其中dot=PyArray\u DESCR（ret）->f->dotfunc是根据dtype
定义的
我不确定我是否回答了您的问题，但很明显代码很复杂，关于您或我如何分配任务的简单推理不适用。请分享您的htop检查！谢谢你的及时和深刻的回答。我想我需要花相当长的时间来阅读代码并找出确切的问题，即使您已经大大缩小了搜索范围。我已经找到了一个解决我眼前需要的方法，但是从长远来看，理解C代码肯定会有很大的价值。
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
        (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
         NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
    return cblas_matrixproduct(typenum, ap1, ap2, out);
}

NPY_BEGIN_THREADS_DESCR(PyArray_DESCR(ap2));
while (it1->index < it1->size) {
    while (it2->index < it2->size) {
        dot(it1->dataptr, is1, it2->dataptr, is2, op, l, ret);
        op += os;
        PyArray_ITER_NEXT(it2);
    }
    PyArray_ITER_NEXT(it1);
    PyArray_ITER_RESET(it2);
}
NPY_END_THREADS_DESCR(PyArray_DESCR(ap2));