Python 如何加速多重插值的scipy.map_坐标？_Python_Numpy_Scipy

Python 如何加速多重插值的scipy.map_坐标？
python numpy
Python 如何加速多重插值的scipy.map_坐标？,python,numpy,scipy,Python,Numpy,Scipy,我有几个值f，g，h，它们定义在我想要插值到新网格（x1，y1，z1）上的同一个规则网格（x，y，z）上。i、我有f（x，y，z），g（x，y，z），h（x，y，z），我想计算f（x1，y1，z1），g（x1，y1，z1），h（x1，y1，z1）我现在使用的是scipy.map\u坐标。但是，每个插值都是单独完成的，点数约为4000000，因此速度相当慢 from scipy.ndimage import map_coordinates import numpy as np # examp
我有几个值f，g，h，它们定义在我想要插值到新网格（x1，y1，z1）上的同一个规则网格（x，y，z）上。i、我有f（x，y，z），g（x，y，z），h（x，y，z），我想计算f（x1，y1，z1），g（x1，y1，z1），h（x1，y1，z1）
我现在使用的是
scipy.map\u坐标。但是，每个插值都是单独完成的，点数约为4000000，因此速度相当慢
from scipy.ndimage import map_coordinates
import numpy as np

# examples of f, g, h
f=np.random.randn(100,50,50)
g=np.random.randn(100,50,50)
h=np.random.randn(100,50,50)

# examples of x1, y1, z1
x1=np.random.rand(4000000)*100
y1=np.random.rand(4000000)*50
z1=np.random.rand(4000000)*50

# my solution at the moment
coords=np.array([x1,y1,z1])

out = np.zeros((3, coords.shape[1]))
out[0]= map_coordinates(f, coords, order=1)
out[1]= map_coordinates(g, coords, order=1)
out[2]= map_coordinates(h, coords, order=1)

有什么方法可以加速计算吗？
我尝试了一下，但不幸的是，它没有打败scipy map\u coordinates
函数。在我那台普通的笔记本电脑上，对map_坐标的三次调用一起大约需要1.0秒，即每个数组每个坐标元组80纳秒。对于300个时钟周期（3.7GHz CPU），这听起来好像很多，但事实证明还有很多工作要做
作业的一部分是将浮点坐标拆分为整数部分和小数部分。这部分作业只需对三个输入数组f、g和h执行一次。不幸的是，这只需要大约100毫秒。只需要做大量的乘法和加法
我使用numba JIT编译的代码实现了它，并注意在内存中设置数组布局，以使缓存访问相当有效，但它的运行速度仍然比scipy.ndimage.map\U坐标慢1.3倍。（编辑：max9111在一个单独的答案中提供了显著的改进。）
我更改了您的coord初始化以确保不需要越界处理：
n = 4000_000
x1=np.random.rand(n)*99
y1=np.random.rand(n)*49
z1=np.random.rand(n)*49

实施：
from numba import njit

@njit(fastmath=True)
def mymap(ars, coords):
    """ars is input arrays, shape (m, nx, ny, nz)
    coords is coordinate array, float, shape (3, n)
    """
    # these have shape (n, 3)
    ijk = coords.T.astype(np.int16).copy() # copy for memory layout
    fijk = (coords.T - ijk).astype(np.float32)
    n = ijk.shape[0]
    m = ars.shape[0]
    out = np.empty((n, m), dtype=np.float64)
    
    for l in range(n):
        i0, j0, k0 = ijk[l, :3]
        # Note: don't write i1, j1, k1 = ijk[l, :3]+1 -- much slower.
        i1, j1, k1 = i0+1, j0+1, k0+1
        fi1, fj1, fk1 = fijk[l, :3]
        fi0, fj0, fk0 = 1-fi1, 1-fj1, 1-fk1
        out[l, :] = (
            fi0 * fj0 * fk0 * ars[:, i0, j0, k0] +
            fi0 * fj0 * fk1 * ars[:, i0, j0, k1] +
            fi0 * fj1 * fk0 * ars[:, i0, j1, k0] +
            fi0 * fj1 * fk1 * ars[:, i0, j1, k1] +
            fi1 * fj0 * fk0 * ars[:, i1, j0, k0] +
            fi1 * fj0 * fk1 * ars[:, i1, j0, k1] +
            fi1 * fj1 * fk0 * ars[:, i1, j1, k0] +
            fi1 * fj1 * fk1 * ars[:, i1, j1, k1]
            )
    return out.T

fgh = np.array([f, g, h]).T.copy().T # optimize memory layout
out = mymap(fgh, coords)

每个坐标元组和每个输入数组有24个浮点乘法和7个浮点加法。此外，还有许多数组索引需要整数多重应用程序。输入数组之间共享的运算量相当小。
我尝试了一下，但不幸的是，它没有超过scipy map\u coordinates
函数。在我那台普通的笔记本电脑上，对map_坐标的三次调用一起大约需要1.0秒，即每个数组每个坐标元组80纳秒。对于300个时钟周期（3.7GHz CPU），这听起来好像很多，但事实证明还有很多工作要做
作业的一部分是将浮点坐标拆分为整数部分和小数部分。这部分作业只需对三个输入数组f、g和h执行一次。不幸的是，这只需要大约100毫秒。只需要做大量的乘法和加法
我使用numba JIT编译的代码实现了它，并注意在内存中设置数组布局，以使缓存访问相当有效，但它的运行速度仍然比scipy.ndimage.map\U坐标慢1.3倍。（编辑：max9111在一个单独的答案中提供了显著的改进。）
我更改了您的coord初始化以确保不需要越界处理：
n = 4000_000
x1=np.random.rand(n)*99
y1=np.random.rand(n)*49
z1=np.random.rand(n)*49

实施：
from numba import njit

@njit(fastmath=True)
def mymap(ars, coords):
    """ars is input arrays, shape (m, nx, ny, nz)
    coords is coordinate array, float, shape (3, n)
    """
    # these have shape (n, 3)
    ijk = coords.T.astype(np.int16).copy() # copy for memory layout
    fijk = (coords.T - ijk).astype(np.float32)
    n = ijk.shape[0]
    m = ars.shape[0]
    out = np.empty((n, m), dtype=np.float64)
    
    for l in range(n):
        i0, j0, k0 = ijk[l, :3]
        # Note: don't write i1, j1, k1 = ijk[l, :3]+1 -- much slower.
        i1, j1, k1 = i0+1, j0+1, k0+1
        fi1, fj1, fk1 = fijk[l, :3]
        fi0, fj0, fk0 = 1-fi1, 1-fj1, 1-fk1
        out[l, :] = (
            fi0 * fj0 * fk0 * ars[:, i0, j0, k0] +
            fi0 * fj0 * fk1 * ars[:, i0, j0, k1] +
            fi0 * fj1 * fk0 * ars[:, i0, j1, k0] +
            fi0 * fj1 * fk1 * ars[:, i0, j1, k1] +
            fi1 * fj0 * fk0 * ars[:, i1, j0, k0] +
            fi1 * fj0 * fk1 * ars[:, i1, j0, k1] +
            fi1 * fj1 * fk0 * ars[:, i1, j1, k0] +
            fi1 * fj1 * fk1 * ars[:, i1, j1, k1]
            )
    return out.T

fgh = np.array([f, g, h]).T.copy().T # optimize memory layout
out = mymap(fgh, coords)

每个坐标元组和每个输入数组有24个浮点乘法和7个浮点加法。此外，还有许多数组索引需要整数多重应用程序。输入数组之间共享的算术数量相当少。这只是对@Han Kwang Nienhuys答案的简短评论。
这里需要改进的主要内容是避免矢量化命令，这可能导致相当高的性能下降
通常，如果您使用默认的C顺序数组，那么最好更改输入和输出（n，3）的数组形状，而不是（3，n）
输入
import numpy as np
import numba as nb
from scipy.ndimage import map_coordinates

# examples of f, g, h
f=np.random.randn(100,50,50)
g=np.random.randn(100,50,50)
h=np.random.randn(100,50,50)

n=4_000_000
# examples of x1, y1, z1
x1=np.random.rand(n)*99
y1=np.random.rand(n)*49
z1=np.random.rand(n)*49

coords=np.array((x1,y1,z1))
fgh = np.array([f, g, h]).T.copy().T # optimize memory layout

代码
#from Han-Kwang Nienhuys
@nb.njit(fastmath=True)
def mymap(ars, coords):
    """ars is input arrays, shape (m, nx, ny, nz)
    coords is coordinate array, float, shape (3, n)
    """
    # these have shape (n, 3)
    ijk = coords.T.astype(np.int16)
    fijk = (coords.T - ijk).astype(np.float32)
    n = ijk.shape[0]
    m = ars.shape[0]
    out = np.empty((n, m), dtype=np.float64)

    for l in range(n):
        i0, j0, k0 = ijk[l, :3]
        # Note: don't write i1, j1, k1 = ijk[l, :3]+1 -- much slower.
        i1, j1, k1 = i0+1, j0+1, k0+1
        fi1, fj1, fk1 = fijk[l, :3]
        fi0, fj0, fk0 = 1-fi1, 1-fj1, 1-fk1
        out[l, :] = (
            fi0 * fj0 * fk0 * ars[:, i0, j0, k0] +
            fi0 * fj0 * fk1 * ars[:, i0, j0, k1] +
            fi0 * fj1 * fk0 * ars[:, i0, j1, k0] +
            fi0 * fj1 * fk1 * ars[:, i0, j1, k1] +
            fi1 * fj0 * fk0 * ars[:, i1, j0, k0] +
            fi1 * fj0 * fk1 * ars[:, i1, j0, k1] +
            fi1 * fj1 * fk0 * ars[:, i1, j1, k0] +
            fi1 * fj1 * fk1 * ars[:, i1, j1, k1]
            )
    return out.T

#optimized version
@nb.njit(fastmath=True,parallel=False)
def mymap_opt(ars, coords):
    """ars is input arrays, shape (m, nx, ny, nz)
    coords is coordinate array, float, shape (3, n)
    """
    # these have shape (n, 3)
    ijk = coords.T.astype(np.int16)
    fijk = (coords.T - ijk).astype(np.float32)
    n = ijk.shape[0]
    m = ars.shape[0]
    out = np.empty((n, m), dtype=np.float64)

    for l in nb.prange(n):
        i0= ijk[l, 0]
        j0= ijk[l, 1]
        k0 =ijk[l, 2]
        # Note: don't write i1, j1, k1 = ijk[l, :3]+1 -- much slower.
        i1, j1, k1 = i0+1, j0+1, k0+1
        fi1=  fijk[l, 0] 
        fj1=  fijk[l, 1] 
        fk1 = fijk[l, 2]

        fi0, fj0, fk0 = 1-fi1, 1-fj1, 1-fk1
        for i in range(ars.shape[0]):
            out[l, i] = (
                fi0 * fj0 * fk0 * ars[i, i0, j0, k0] +
                fi0 * fj0 * fk1 * ars[i, i0, j0, k1] +
                fi0 * fj1 * fk0 * ars[i, i0, j1, k0] +
                fi0 * fj1 * fk1 * ars[i, i0, j1, k1] +
                fi1 * fj0 * fk0 * ars[i, i1, j0, k0] +
                fi1 * fj0 * fk1 * ars[i, i1, j0, k1] +
                fi1 * fj1 * fk0 * ars[i, i1, j1, k0] +
                fi1 * fj1 * fk1 * ars[i, i1, j1, k1]
                )
    return out.T

计时
out_1 = mymap(fgh, coords)
out_2 = mymap_opt(fgh, coords)
print(np.allclose(out_1,out_2))
#True

%timeit out = mymap(fgh, coords)
#1.09 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit out = mymap_opt(fgh, coords)
#parallel=True
#144 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#parallel=False
#259 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这只是对@Han Kwang nienhuy回答的简短评论。
这里需要改进的主要内容是避免矢量化命令，这可能导致相当高的性能下降
通常，如果您使用默认的C顺序数组，那么最好更改输入和输出（n，3）的数组形状，而不是（3，n）
输入
import numpy as np
import numba as nb
from scipy.ndimage import map_coordinates

# examples of f, g, h
f=np.random.randn(100,50,50)
g=np.random.randn(100,50,50)
h=np.random.randn(100,50,50)

n=4_000_000
# examples of x1, y1, z1
x1=np.random.rand(n)*99
y1=np.random.rand(n)*49
z1=np.random.rand(n)*49

coords=np.array((x1,y1,z1))
fgh = np.array([f, g, h]).T.copy().T # optimize memory layout

代码
#from Han-Kwang Nienhuys
@nb.njit(fastmath=True)
def mymap(ars, coords):
    """ars is input arrays, shape (m, nx, ny, nz)
    coords is coordinate array, float, shape (3, n)
    """
    # these have shape (n, 3)
    ijk = coords.T.astype(np.int16)
    fijk = (coords.T - ijk).astype(np.float32)
    n = ijk.shape[0]
    m = ars.shape[0]
    out = np.empty((n, m), dtype=np.float64)

    for l in range(n):
        i0, j0, k0 = ijk[l, :3]
        # Note: don't write i1, j1, k1 = ijk[l, :3]+1 -- much slower.
        i1, j1, k1 = i0+1, j0+1, k0+1
        fi1, fj1, fk1 = fijk[l, :3]
        fi0, fj0, fk0 = 1-fi1, 1-fj1, 1-fk1
        out[l, :] = (
            fi0 * fj0 * fk0 * ars[:, i0, j0, k0] +
            fi0 * fj0 * fk1 * ars[:, i0, j0, k1] +
            fi0 * fj1 * fk0 * ars[:, i0, j1, k0] +
            fi0 * fj1 * fk1 * ars[:, i0, j1, k1] +
            fi1 * fj0 * fk0 * ars[:, i1, j0, k0] +
            fi1 * fj0 * fk1 * ars[:, i1, j0, k1] +
            fi1 * fj1 * fk0 * ars[:, i1, j1, k0] +
            fi1 * fj1 * fk1 * ars[:, i1, j1, k1]
            )
    return out.T

#optimized version
@nb.njit(fastmath=True,parallel=False)
def mymap_opt(ars, coords):
    """ars is input arrays, shape (m, nx, ny, nz)
    coords is coordinate array, float, shape (3, n)
    """
    # these have shape (n, 3)
    ijk = coords.T.astype(np.int16)
    fijk = (coords.T - ijk).astype(np.float32)
    n = ijk.shape[0]
    m = ars.shape[0]
    out = np.empty((n, m), dtype=np.float64)

    for l in nb.prange(n):
        i0= ijk[l, 0]
        j0= ijk[l, 1]
        k0 =ijk[l, 2]
        # Note: don't write i1, j1, k1 = ijk[l, :3]+1 -- much slower.
        i1, j1, k1 = i0+1, j0+1, k0+1
        fi1=  fijk[l, 0] 
        fj1=  fijk[l, 1] 
        fk1 = fijk[l, 2]

        fi0, fj0, fk0 = 1-fi1, 1-fj1, 1-fk1
        for i in range(ars.shape[0]):
            out[l, i] = (
                fi0 * fj0 * fk0 * ars[i, i0, j0, k0] +
                fi0 * fj0 * fk1 * ars[i, i0, j0, k1] +
                fi0 * fj1 * fk0 * ars[i, i0, j1, k0] +
                fi0 * fj1 * fk1 * ars[i, i0, j1, k1] +
                fi1 * fj0 * fk0 * ars[i, i1, j0, k0] +
                fi1 * fj0 * fk1 * ars[i, i1, j0, k1] +
                fi1 * fj1 * fk0 * ars[i, i1, j1, k0] +
                fi1 * fj1 * fk1 * ars[i, i1, j1, k1]
                )
    return out.T

计时
out_1 = mymap(fgh, coords)
out_2 = mymap_opt(fgh, coords)
print(np.allclose(out_1,out_2))
#True

%timeit out = mymap(fgh, coords)
#1.09 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit out = mymap_opt(fgh, coords)
#parallel=True
#144 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#parallel=False
#259 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

代码中有一个性能关键的“错误”。每个矢量化命令本身就是一个for循环。（8个单内环）。通过显式写出内部循环，可以将所有8个循环简单地连接到一个循环中<代码>对于范围内的i（ars.shape[0]）：
out[l，i]=..
这将提供大约3倍的加速比。然后，您还可以并行化外部循环（parallel=True
）和nb.prange（n）中l的：
。（另一个系数为2）。完全避免使用其他矢量化命令也可以实现额外的加速，即使它们在本例中看起来很方便。@max9111我不知道添加显式内部循环如何加快处理速度。您能为这一部分显示更详细的代码吗？@max9111我原以为JIT编译器会想出如何在不分配新数组的情况下用SIMD指令处理ars[：，i0，j0，k0]
，但我似乎弄错了。谢谢。您的代码中有一个性能关键的“错误”。每个矢量化命令本身就是一个for循环。（8个单内环）。通过显式写出内部循环，可以将所有8个循环简单地连接到一个循环中<代码>对于范围内的i（ars.shape[0]）：
out[l，i]=..
这将提供大约3倍的加速比。然后，您还可以并行化外部循环（parallel=True
）和nb.prange（n）中l的：
。（另一个系数为2）。完全避免使用其他矢量化命令也可以实现额外的加速，即使它们在本例中看起来很方便。@max9111我不知道添加显式内部循环如何加快处理速度。您能为这部分显示更详细的代码吗？@max9111我希望JIT编译器能够找出如何处理ars[：，i0，j0，k0]