Python：将二进制值的2d数组打包到UINT64数组中的最快方法_Python_Numpy_Vectorization_Numba_Bit Packing

Python：将二进制值的2d数组打包到UINT64数组中的最快方法

python numpy

Python：将二进制值的2d数组打包到UINT64数组中的最快方法,python,numpy,vectorization,numba,bit-packing,Python,Numpy,Vectorization,Numba,Bit Packing,我有一个二维UINT8numpy数组，大小（149797,64）。每个元素都是0或1。我想将每行中的这些二进制值打包成一个UINT64值，这样我就得到了一个形状为149797的UINT64数组。我使用numpy bitpack函数尝试了以下代码 test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8) col_pack=np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint6

我有一个二维UINT8numpy数组，大小（149797,64）。每个元素都是0或1。我想将每行中的这些二进制值打包成一个UINT64值，这样我就得到了一个形状为149797的UINT64数组。我使用numpy bitpack函数尝试了以下代码

test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
col_pack=np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)

packbits函数执行大约需要10 ms。对这个数组本身进行一次简单的整形似乎需要7 ms左右的时间；但是速度没有提高
最后，我还想使用numba为CPU编译它

@njit def shifting(bitlist): x=np.zeros(149797,dtype=np.uint64) #54 rows,cols=bitlist.shape for i in range(0,rows): #56 out=0 for bit in range(0,cols): out = (out << 1) | bitlist[i][bit] # If i comment out bitlist, time=190 microsec x[i]=np.uint64(out) # Reduces time to microseconds if line is commented in njit return x
执行时间（google colab双核2.2Ghz）略好于3.24ms 目前，使用swapbytes（Paul的）方法的python解决方案似乎是最好的解决方案，即1.74ms
我们如何进一步加快这一转变？是否可以使用矢量化（或并行化）、位数组等来实现加速
参考：
在12核计算机上（英特尔（R）至强（R）CPU E5-1650 v2@3.50GHz）
Pauls方法：1595.0微秒（我想它不使用多核）
Numba代码：146.0微秒（前面提到的并行Numba）

i、大约10倍的加速
您可以使用
byteswap
而不是重新整形等方法获得相当大的加速比：

test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8) np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64) # array([ 1079982015491401631, 246233595099746297, 16216705265283876830, # ..., 1943876987915462704, 14189483758685514703, 12753669247696755125], dtype=uint64) np.packbits(test).view(np.uint64).byteswap() # array([ 1079982015491401631, 246233595099746297, 16216705265283876830, # ..., 1943876987915462704, 14189483758685514703, 12753669247696755125], dtype=uint64) timeit(lambda:np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64),number=100) # 1.1054180909413844 timeit(lambda:np.packbits(test).view(np.uint64).byteswap(),number=100) # 0.18370431219227612
有点麻木的解决方案（版本0.46/Windows）
代码

import numpy as np import numba as nb #with memory allocation @nb.njit(parallel=True) def shifting(bitlist): assert bitlist.shape[1]==64 x=np.empty(bitlist.shape[0],dtype=np.uint64) for i in nb.prange(bitlist.shape[0]): out=np.uint64(0) for bit in range(bitlist.shape[1]): out = (out << 1) | bitlist[i,bit] x[i]=out return x #without memory allocation @nb.njit(parallel=True) def shifting_2(bitlist,x): assert bitlist.shape[1]==64 for i in nb.prange(bitlist.shape[0]): out=np.uint64(0) for bit in range(bitlist.shape[1]): out = (out << 1) | bitlist[i,bit] x[i]=out return x

通过不对
位列表
使用链式索引，可以减少大约三分之一的运行时间。将
位列表[i][bit]
替换为
位列表[i，bit]
。在循环的每次迭代中，您都要创建一个大小不等的中间数组。当你这样做
rows*cols
次时，它加起来了！您可以尝试并行化（@njit（parallel=True），而不是外部循环中的range nb.prange。也不需要使用零进行初始化（使用np.empty）。你还有很多数组需要转换吗？@user3483203在普通python中，它将速度从6.2秒提高到5.4秒；但是在numba jitted版本中没有太大的变化。另外，packbit版本已经给出了10毫秒。纯python循环给出了秒的范围，packbits（10毫秒）和numba（到目前为止）在6毫秒的范围内，我期待<1ms。它给出了1.74毫秒。到目前为止，这是最好的。我实际上想使用带有numba jit的python版本来利用矢量化、并行化等功能；但不幸的是，numba不完全支持packbits功能。谢谢……我想这取决于你的硬件（内核数量、内存类型等）、库版本等。我在google colab上使用双核CPU（2.3Ghz，每个内核2个线程）尝试了您的精确代码结果如下：-4.06毫秒用于移位，4毫秒用于移位，2.22毫秒用于byteswap包，12.8毫秒用于整形包。Packbits byteswap似乎是最快的，内存分配，空（而不是零）在这里似乎没有任何区别（似乎…）。你认为呢？这是因为硬件和库版本的差异吗？@anilsathyan7你使用的是一个特殊版本的numpy，还是一个标准的Linux版本？根据numpy的编译方式（编译器和编译器标志），可能会有很大的差异。在Windows上（Anaconda）在相同的硬件上，计时可能会有很大的差异。我在其他示例中测量到了很大的差异（标准Linux构建总是更快）例如，100个循环，每个循环100个循环3:4.1毫秒的最佳值，每个循环100个循环3:4毫秒的最佳值，每个循环100个循环3:2.06毫秒的最佳值，每个循环3:12.7毫秒的最佳值……正如我所说，我在colab中使用了“默认”numpy、numba等。它应该是高度优化的。使用numba并行方法，与pyth相比，我在高端机器上获得了10倍的速度提升关于packbits方法！！！我想知道，是否有一种只使用python、使用所有内核的并行方法来编写相同的代码（packbits）？这将是一个公平的比较。。。
import numpy as np import numba as nb #with memory allocation @nb.njit(parallel=True) def shifting(bitlist): assert bitlist.shape[1]==64 x=np.empty(bitlist.shape[0],dtype=np.uint64) for i in nb.prange(bitlist.shape[0]): out=np.uint64(0) for bit in range(bitlist.shape[1]): out = (out << 1) | bitlist[i,bit] x[i]=out return x #without memory allocation @nb.njit(parallel=True) def shifting_2(bitlist,x): assert bitlist.shape[1]==64 for i in nb.prange(bitlist.shape[0]): out=np.uint64(0) for bit in range(bitlist.shape[1]): out = (out << 1) | bitlist[i,bit] x[i]=out return x

test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8) #If you call this function multiple times, only allocating memory #once may be enough x=np.empty(test.shape[0],dtype=np.uint64) #Warmup first call takes significantly longer res=shifting(test) res=shifting_2(test,x) %timeit res=shifting(test) #976 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit res=shifting_2(test,x) #764 µs ± 63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit np.packbits(test).view(np.uint64).byteswap() #8.07 ms ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64) #17.9 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)