我的Julia循环/devectorized代码出了什么问题_Julia_Vectorization

我的Julia循环/devectorized代码出了什么问题

julia

我的Julia循环/devectorized代码出了什么问题,julia,vectorization,Julia,Vectorization,我用的是Julia 1.0。请考虑以下代码： using LinearAlgebra using Distributions ## create random data const data = rand(Uniform(-1,2), 100000, 2) function test_function_1(data) theta = [1 2] coefs = theta * data[:,1:2]' res = coefs' .* data[:,1:2]

我用的是Julia 1.0。请考虑以下代码：

using LinearAlgebra
using Distributions

## create random data
const data = rand(Uniform(-1,2), 100000, 2)

function test_function_1(data)
    theta = [1 2]
    coefs = theta * data[:,1:2]'
    res   = coefs' .* data[:,1:2]
    return sum(res, dims = 1)'
end

function test_function_2(data)
    theta   = [1 2]
    sum_all = zeros(2)
    for i = 1:size(data)[1]
        sum_all .= sum_all + (theta * data[i,1:2])[1] *  data[i,1:2]
    end
    return sum_all
end

在第一次运行它之后，我对它进行了计时

julia> @time test_function_1(data)
  0.006292 seconds (16 allocations: 5.341 MiB)
2×1 Adjoint{Float64,Array{Float64,2}}:
 150958.47189289227
 225224.0374366073

julia> @time test_function_2(data)
  0.038112 seconds (500.00 k allocations: 45.777 MiB, 15.61% gc time)
2-element Array{Float64,1}:
 150958.4718928927
 225224.03743660534

test\u function\u 1

在分配和速度方面都有显著优势，但

test\u function\u 1

没有被开发。我希望

test\u function\u 2

的性能更好。请注意，这两个函数的作用相同

我有一种预感，那是因为在

test\u function\u 2

中，我使用了

sum\u all.=sum\u all+…

，但我不确定这是一个问题。我能得到一个提示吗？

因此，首先让我评论一下，如果我想使用循环，我将如何编写您的函数：

function test_function_3(data)
    theta   = (1, 2)
    sum_all = zeros(2)
    for row in eachrow(data)
        sum_all .+= dot(theta, row) .*  row
    end
    return sum_all
end

下面是三个选项的基准比较：

julia> @benchmark test_function_1($data)
BenchmarkTools.Trial: 
  memory estimate:  5.34 MiB
  allocs estimate:  16
  --------------
  minimum time:     1.953 ms (0.00% GC)
  median time:      1.986 ms (0.00% GC)
  mean time:        2.122 ms (2.29% GC)
  maximum time:     4.347 ms (8.00% GC)
  --------------
  samples:          2356
  evals/sample:     1

julia> @benchmark test_function_2($data)
BenchmarkTools.Trial: 
  memory estimate:  45.78 MiB
  allocs estimate:  500002
  --------------
  minimum time:     16.316 ms (7.44% GC)
  median time:      16.597 ms (7.63% GC)
  mean time:        16.845 ms (8.01% GC)
  maximum time:     34.050 ms (4.45% GC)
  --------------
  samples:          297
  evals/sample:     1

julia> @benchmark test_function_3($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     777.204 μs (0.00% GC)
  median time:      791.458 μs (0.00% GC)
  mean time:        799.505 μs (0.00% GC)
  maximum time:     1.262 ms (0.00% GC)
  --------------
  samples:          6253
  evals/sample:     1

接下来，如果在循环中显式实现

点

，则可以加快速度：

julia> function test_function_4(data)
           theta   = (1, 2)
           sum_all = zeros(2)
           for row in eachrow(data)
               @inbounds sum_all .+= (theta[1]*row[1]+theta[2]*row[2]) .*  row
           end
           return sum_all
       end
test_function_4 (generic function with 1 method)

julia> @benchmark test_function_4($data)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  1
  --------------
  minimum time:     502.367 μs (0.00% GC)
  median time:      502.547 μs (0.00% GC)
  mean time:        505.446 μs (0.00% GC)
  maximum time:     806.631 μs (0.00% GC)
  --------------
  samples:          9888
  evals/sample:     1

为了理解这些差异，让我们看一下您的这一行代码：

sum_all .= sum_all + (theta * data[i,1:2])[1] *  data[i,1:2]

让我们计算您在此表达式中执行的内存分配：

sum_all .= 
    sum_all
    + # allocation of a new vector as a result of addition
    (theta
     *  # allocation of a new vector as a result of multiplication
     data[i,1:2] # allocation of a new vector via getindex
    )[1]
    * # allocation of a new vector as a result of multiplication
    data[i,1:2] # allocation of a new vector via getindex

因此，您可以看到，在循环的每个迭代中，您分配了五次。分配是昂贵的。您可以在基准测试中看到这一点，在这个过程中，您有5000002个分配：

1分配
```
sum\u all
```
1分配
```
theta
```
循环中的500000个分配（5*100000）

此外，还可以执行索引，如执行以下操作的

数据[i，1:2]

边界检查，这也是一个小成本（但与分配相比微不足道）

现在在函数

test\u function\u 3

中，我使用

eachrow（数据）

。这一次，我还获得了

数据行

矩阵，但它们作为视图（不是新矩阵）返回，因此在循环中没有分配。接下来，我再次使用

dot

函数来避免之前由矩阵乘法引起的分配（我将

theta

从

matrix

更改为

Tuple

，因为

dot

要快一点，但这是次要的）。最后我写

um_all.+=dot（theta，row）。*row

，在本例中所有操作都是广播的，因此Julia可以进行广播融合（再次-没有分配发生）
在
test\u function_4
中，我只是用展开循环替换
dot
，因为我们知道我们有两个元素来计算点积。实际上，如果您完全展开所有内容并使用
@simd
，它会更快：

julia> function test_function_5(data) theta = (1, 2) s1 = 0.0 s2 = 0.0 @inbounds @simd for i in axes(data, 1) r1 = data[i, 1] r2 = data[i, 2] mul = theta[1]*r1 + theta[2]*r2 s1 += mul * r1 s2 += mul * r2 end return [s1, s2] end test_function_5 (generic function with 1 method) julia> @benchmark test_function_5($data) BenchmarkTools.Trial: memory estimate: 96 bytes allocs estimate: 1 -------------- minimum time: 22.721 μs (0.00% GC) median time: 23.146 μs (0.00% GC) mean time: 24.306 μs (0.00% GC) maximum time: 100.109 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1

因此，您可以看到，通过这种方式，您比使用
test\u function\u 1
的速度快100倍左右。尽管如此，
test\u function\u 3
还是比较快的，而且它是完全通用的，所以通常我会写一些类似于
test\u function\u 3
的东西，除非我真的需要非常快，并且知道我的数据的维度是固定的和小的。
你能解释一下为什么你的函数更优秀吗？在
test\u function\u 5
中，使用索引代替
eachrow（data）
，如
r1=data[i，1]
中所示。这不是让你分配吗？啊，这是个好问题。不同之处在于，
data[i，1]
从
矩阵
中获取单个单元格（一个
Float64
值），因此它不进行分配。如果我们编写了例如
数据[i，1:1]
我们将得到一个包含单个单元格的1元素
向量
，然后它将进行分配。