juliavs-Python中的位压缩性能_Python_Julia_Bit Packing

juliavs-Python中的位压缩性能

python julia

juliavs-Python中的位压缩性能,python,julia,bit-packing,Python,Julia,Bit Packing,这个问题（）得到了一个非常好的答案和非常高效的代码，我只能用C代码来匹配。这促使我尝试至少匹配Julia中的Python代码下面是Python代码，它需要0.64秒，而C代码需要0.27秒才能将位打包为无符号整数 import numpy as np import numba as nb import time colm = int(200000/8) rows = 10000 cols = int(colm*8) AU = np.random.randint(2,size=(rows, co

这个问题（）得到了一个非常好的答案和非常高效的代码，我只能用C代码来匹配。这促使我尝试至少匹配Julia中的Python代码

下面是Python代码，它需要0.64秒，而C代码需要0.27秒才能将位打包为无符号整数

import numpy as np
import numba as nb
import time
colm = int(200000/8)
rows = 10000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

start_time = time.time()

compute(A, AU)

end_time = time.time()
print(end_time - start_time)

不同的方法需要3到6秒。不确定如何提高性能以至少与Python/Numba相匹配，对于单线程到位数组的转换，

Bool.（AU）

AU.%Bool

（请参见编辑说明）应该是有效的：

using Random
using BenchmarkTools
AU = rand(Bool, 10_000, 200_000)
@benchmark Bool.($AU)

编辑：我刚刚意识到

Bool.（AU）

对您来说不太合适，因为您是从8位整数数组而不是Bool数组进行转换，所以

Bool.（AU）

需要检查

AU

的每个元素是否

或

。相反，使用

AU.%Bool

，它将只取每个整数的最小位，并且应该具有上面所示的性能。

如果要比较它们的性能，为什么不尝试使Julia代码看起来像Python代码？比如：

rowm = 200000 ÷ 8
cols = 10000
rows = rowm * 8

AU = rand(Int8.(0:1), rows, cols)
A = zeros(UInt8, rowm, cols)

function compute!(A, AU)
    for i in 1:size(A, 2)
        Threads.@threads for j in 1:size(A, 1)
            offset = (j-1) * 8 + 1
            res =  AU[offset  , i] << 7
            res |= AU[offset+1, i] << 6
            res |= AU[offset+2, i] << 5
            res |= AU[offset+3, i] << 4
            res |= AU[offset+4, i] << 3
            res |= AU[offset+5, i] << 2
            res |= AU[offset+6, i] << 1
            res |= AU[offset+7, i]
            A[j, i] = res % UInt8
        end
    end
end

rowm=200000÷8
科尔斯=10000
行=行m*8
AU=兰特（整数8.（0:1），行，列）
A=零（UInt8、rowm、cols）
函数计算！（A、非盟）
对于1中的i：尺寸（A，2）
螺纹。@1英寸j的螺纹：尺寸（A，1）
偏移量=（j-1）*8+1
res=AU[offset，i]简要说明：在Julia中进行切片会创建一个副本，因此每次执行a[：，…]
操作时，您都在分配一个全新的数组。使用视图：numba代码是高度优化和并行的。我很惊讶你能用简单的C代码来匹配性能。你需要做一些类似于julia中的nb.prange
的事情来接近它。是的，视图帮助<代码>计算4

无视图：

3.545348秒（400.00 k分配：3.791 GiB，16.12%gc时间）

；带视图：

1.850615秒（200.00k分配：1.895gib，14.13%gc时间）

。注意分配数量的减少。对于我来说，没有

@视图的compute2
是2.898168秒（1.98m分配：1.985gib，5.86%gc时间，10.84%编译时间）
，有@视图的是0.363232秒（667.67K分配：32.487MIB，4.75%gc时间，50.73%编译时间）
。此外，更好地使用Julia代码进行基准测试，并插入全局变量：``Julia>@btime compute2（$B，$AU）；172.111毫秒（0分配：0字节）``@Megalng现在只需在compute2函数中使用@views宏，它就可以匹配甚至改进。使用A=Bool.（A2）
可以提供类似的性能。请参阅@VincentYu-answer。与其说我喜欢模仿，不如说我喜欢找出如何打败Python所能想出的最好的方法。我使用colm=200000/8和rows=30000尝试了你的代码。使用用@views宏修改的compute2（）函数，总运行时间为48.0秒，而不是11.7秒。尽管我指定了julia--threads 4，但由于某种原因，多线程似乎并不起作用。我只看到一个CPU运行100%。移动线程。@Threads
到外部循环（通过I
），并添加@inbounds
。这使我的计算机加速了20倍，运行速度为100毫秒，速度为20000x10000。对于20000x30000矩阵，它以330ms运行。顺便说一句，现在创建AU
对我的耐心来说是一个很大的压力，但是从元组而不是从范围中采样，速度要快得多。因此，rand（UInt8（（0，1）），rows，cols）
@DNF我已经将@Threads
移动到外部循环，它没有任何效果。我应该把@inbounds宏放在哪里。Bool.（AU）确实很慢。AU.%Bool执行得很好，但在@views宏中没有超过compute2（）。
using Random
using BenchmarkTools
AU = rand(Bool, 10_000, 200_000)
@benchmark Bool.($AU)

BenchmarkTools.Trial: 
  memory estimate:  238.42 MiB
  allocs estimate:  4
  --------------
  minimum time:     658.897 ms (0.00% GC)
  median time:      672.948 ms (0.00% GC)
  mean time:        676.287 ms (0.86% GC)
  maximum time:     710.870 ms (6.57% GC)
  --------------
  samples:          8
  evals/sample:     1

rowm = 200000 ÷ 8
cols = 10000
rows = rowm * 8

AU = rand(Int8.(0:1), rows, cols)
A = zeros(UInt8, rowm, cols)

function compute!(A, AU)
    for i in 1:size(A, 2)
        Threads.@threads for j in 1:size(A, 1)
            offset = (j-1) * 8 + 1
            res =  AU[offset  , i] << 7
            res |= AU[offset+1, i] << 6
            res |= AU[offset+2, i] << 5
            res |= AU[offset+3, i] << 4
            res |= AU[offset+4, i] << 3
            res |= AU[offset+5, i] << 2
            res |= AU[offset+6, i] << 1
            res |= AU[offset+7, i]
            A[j, i] = res % UInt8
        end
    end
end