具有不同BLAS实现的NumPy性能

具有不同BLAS实现的NumPy性能,numpy,amazon-ec2,blas,accelerate-framework,openblas,Numpy,Amazon Ec2,Blas,Accelerate Framework,Openblas,我正在运行一个用Python实现并使用NumPy的算法。算法中计算成本最高的部分是求解一组线性系统(即调用numpy.linalg.solve()),我提出了一个小基准: import numpy as np import time # Create two large random matrices a = np.random.randn(5000, 5000) b = np.random.randn(5000, 5000) t1 = time.time() # That's the ex

我正在运行一个用Python实现并使用NumPy的算法。算法中计算成本最高的部分是求解一组线性系统(即调用
numpy.linalg.solve()
),我提出了一个小基准:

import numpy as np
import time

# Create two large random matrices
a = np.random.randn(5000, 5000)
b = np.random.randn(5000, 5000)

t1 = time.time()
# That's the expensive call:
np.linalg.solve(a, b)
print time.time() - t1
我一直在运行这个:

  • 我的笔记本电脑是2013年末的MacBook Pro 15”,具有4个2GHz内核(
    sysctl-n machdep.cpu.brand_string
    为我提供了英特尔(R)Core(TM)i7-4750HQ cpu@2.00GHz)
  • Amazon EC2
    c3.xlarge
    实例,带有4个VCPU。Amazon将其宣传为“高频Intel Xeon E5-2680 v2(常春藤桥)处理器”
  • 底线:

    • 在Mac电脑上,它的运行时间约为4.5秒
    • 在EC2实例上,它的运行时间约为19.5秒
    我也在其他基于OpenBLAS/Intel MKL的设置上尝试过它,运行时总是与我在EC2实例上得到的运行时相当(以硬件配置为模数)

    有人能解释一下为什么Mac(使用加速框架)上的性能提高了4倍以上吗?下面提供了关于每个版本中NumPy/BLAS设置的更多详细信息

    笔记本电脑设置
    numpy.show\u config()
    为我提供:

    atlas_threads_info:
      NOT AVAILABLE
    blas_opt_info:
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
        define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_blas_threads_info:
      NOT AVAILABLE
    openblas_info:
      NOT AVAILABLE
    lapack_opt_info:
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        extra_compile_args = ['-msse3']
        define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_info:
      NOT AVAILABLE
    lapack_mkl_info:
      NOT AVAILABLE
    blas_mkl_info:
      NOT AVAILABLE
    atlas_blas_info:
      NOT AVAILABLE
    mkl_info:
      NOT AVAILABLE
    
    atlas_threads_info:
        libraries = ['lapack', 'openblas']
        library_dirs = ['/usr/lib']
        define_macros = [('ATLAS_INFO', '"\\"None\\""')]
        language = f77
        include_dirs = ['/usr/include/atlas']
    blas_opt_info:
        libraries = ['openblas']
        library_dirs = ['/usr/lib']
        language = f77
    openblas_info:
        libraries = ['openblas']
        library_dirs = ['/usr/lib']
        language = f77
    lapack_opt_info:
        libraries = ['lapack', 'openblas']
        library_dirs = ['/usr/lib']
        define_macros = [('ATLAS_INFO', '"\\"None\\""')]
        language = f77
        include_dirs = ['/usr/include/atlas']
    openblas_lapack_info:
      NOT AVAILABLE
    lapack_mkl_info:
      NOT AVAILABLE
    blas_mkl_info:
      NOT AVAILABLE
    mkl_info:
      NOT AVAILABLE
    
    EC2实例设置: 在Ubuntu 14.04上,我用

    sudo apt-get install libopenblas-base libopenblas-dev
    
    安装NumPy时,我创建了一个
    site.cfg
    ,包含以下内容:

    [default]
    library_dirs= /usr/lib/openblas-base
    
    [atlas]
    atlas_libs = openblas
    
    numpy.show\u config()
    为我提供:

    atlas_threads_info:
      NOT AVAILABLE
    blas_opt_info:
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
        define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_blas_threads_info:
      NOT AVAILABLE
    openblas_info:
      NOT AVAILABLE
    lapack_opt_info:
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        extra_compile_args = ['-msse3']
        define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_info:
      NOT AVAILABLE
    lapack_mkl_info:
      NOT AVAILABLE
    blas_mkl_info:
      NOT AVAILABLE
    atlas_blas_info:
      NOT AVAILABLE
    mkl_info:
      NOT AVAILABLE
    
    atlas_threads_info:
        libraries = ['lapack', 'openblas']
        library_dirs = ['/usr/lib']
        define_macros = [('ATLAS_INFO', '"\\"None\\""')]
        language = f77
        include_dirs = ['/usr/include/atlas']
    blas_opt_info:
        libraries = ['openblas']
        library_dirs = ['/usr/lib']
        language = f77
    openblas_info:
        libraries = ['openblas']
        library_dirs = ['/usr/lib']
        language = f77
    lapack_opt_info:
        libraries = ['lapack', 'openblas']
        library_dirs = ['/usr/lib']
        define_macros = [('ATLAS_INFO', '"\\"None\\""')]
        language = f77
        include_dirs = ['/usr/include/atlas']
    openblas_lapack_info:
      NOT AVAILABLE
    lapack_mkl_info:
      NOT AVAILABLE
    blas_mkl_info:
      NOT AVAILABLE
    mkl_info:
      NOT AVAILABLE
    

    这种行为的原因可能是Accelerate使用多线程,而其他的则不使用

    大多数BLAS实现都遵循环境变量
    OMP_NUM_THREADS
    来确定要使用多少线程。我相信如果没有明确告知,它们只使用一个线程。 但是,听起来线程是默认打开的;可以通过设置环境变量
    VECLIB\u max\u THREADS
    来关闭线程

    要确定这是否真的发生了,请尝试

    export VECLIB_MAXIMUM_THREADS=1
    
    在调用加速版本之前,以及

    export OMP_NUM_THREADS=4
    
    对于其他版本


    无论这是否是真正的原因,在使用BLAS时始终设置这些变量是一个好主意,以确保您能够控制正在发生的事情。

    Haswell的原始计算是每个堆芯每个周期Ivybridge的2倍(由于包含FMA)。我想知道您的openblas是否是在未启用AVX支持的情况下构建的?这将提供另外2倍的支持。听起来可能与此有关。您是否可以检查您的EC2实例是否实际在处理多线程BLAS操作?链接到加速,
    VECLIB\u最大线程数
    会影响
    numpy.linalg.norm
    的性能。
    scipy.lina另一方面,lg.norm始终较慢,不受变量影响,这使我相信它与加速无关,而是使用参考LAPACK。感谢Elmar.Fwiw,如果ord in(None,2)和(a.ndim==1):nrm2=get_blas_funcs('nrm2')
    norm
    in says“#立即处理一些默认的、简单的、快速的和常见的情况”。