Julia 计算和f（i）x（i）x（i）和x27；快速的_Julia

Julia 计算和f（i）x（i）x（i）和x27；快速的

julia

Julia 计算和f（i）x（i）x（i）和x27；快速的,julia,Julia,我试图计算f（I）*x（I）*x（I）其中，x（i）是列向量，x（i）是转置，f（i）是标量。所以它是外积的加权和在MATLAB中，使用bsxfun可以非常快速地实现这一点。以下代码在我的笔记本电脑（MacBook Air 2010）上以260毫秒的速度运行我一直想让茱莉亚做同样的工作，但我做得太快了 N = int(1e5); d = 100; f = randn(N); x = randn(N, d); function hess1(x, f) N, d = size(x)

我试图计算

f（I）*x（I）*x（I）

其中，

x（i）

是列向量，

x（i）

是转置，

f（i）

是标量。所以它是外积的加权和

在MATLAB中，使用

bsxfun

可以非常快速地实现这一点。以下代码在我的笔记本电脑（MacBook Air 2010）上以260毫秒的速度运行

我一直想让茱莉亚做同样的工作，但我做得太快了

N = int(1e5);
d = 100;
f = randn(N);
x = randn(N, d);

function hess1(x, f)
    N, d = size(x);
    temp = zeros(N, d);
    @simd for kk = 1:N
        @inbounds temp[kk, :] = f[kk] * x[kk, :];
    end
    H = x' * temp;
end

function hess2(x, f)
    N, d = size(x);
    H2 = zeros(d,d);
    @simd for k = 1:N
        @inbounds H2 += f[k] * x[k, :]' * x[k, :];
    end
    return H2
end

function hess3(x, f)
    N, d = size(x);
    H3 = zeros(d,d);
    for k = 1:N
        for k1 = 1:d
            @simd for k2 = 1:d
                @inbounds H3[k1, k2] += x[k, k1] * x[k, k2] * f[k];
            end
        end
    end
    return H3
end

结果是

@time H1 = hess1(x, f);
@time H2 = hess2(x, f);
@time H3 = hess3(x, f);
elapsed time: 0.776116469 seconds (262480224 bytes allocated, 26.49% gc time)
elapsed time: 30.496472345 seconds (16385442496 bytes allocated, 56.07% gc time)
elapsed time: 2.769934563 seconds (80128 bytes allocated)

hess1

类似于MATLAB的

bsxfun

但速度较慢，

hess3

不使用临时内存，但速度明显较慢。我最好的julia代码比MATLAB慢3倍

如何使这段代码更快？

IJulia要点：

Julia版本：0.3.0-rc1

编辑：我在一台功能更强大的计算机上进行了测试（3.5GHz英特尔i7，4核，L2256KB，L38MB）

MATLAB R2014a不带
```
-singleCompThread
```
：0.053 s
MATLAB R2014a和
```
-singleCompThread
```
：0.080 s（@tholy的建议）
Julia 0.3.0-rc1
- ```
hess1
```
  运行时间：0.215406904秒（分配262498648字节，32.74%的gc时间）
- ```
hess2
```
  运行时间：10.722578699秒（分配16384080176字节，62.20%的gc时间）
- ```
hess3
```
  运行时间：1.065504355秒（分配了80176字节）
- ```
bsxfunstyle
```
  运行时间：0.063540168秒（分配80081072字节，25.04%的gc时间）（@iaindanning的解决方案）

实际上，使用

broadcast

要快得多，与MATLAB的bsxfun相当。

您正在寻找

broadcast

函数。这是你的电话号码

我实现了您的版本以及

广播

版本，以下是我的发现：

srand(1988)
N = 100_000
d = 100
f = randn(N, 1)
x = randn(N, d)

function hess1(x, f)
    N, d = size(x);
    temp = zeros(N, d);
    @simd for kk = 1:N
        @inbounds temp[kk, :] = f[kk] * x[kk, :];
    end
    H = x' * temp;
end

function bsxfunstyle(x, f)
    x' * broadcast(*,f,x)
end

# Warmup
hess1(x,f)
bsxfunstyle(x, f)

# For real
println("Hess1")
@time H1 = hess1(x, f)
println("Broadcast")
@time H2 = bsxfunstyle(x, f)

# Check solutions are identical
println(sum(abs(H1-H2)))

有输出

Hess1
elapsed time: 0.324256216 seconds (262498648 bytes allocated, 33.95% gc time)
Broadcast
elapsed time: 0.126647594 seconds (80080696 bytes allocated, 20.22% gc time)
0.0

您的函数存在几个性能问题

您正在通过
```
x[kk，：]
```
创建临时数组
当矩阵按列顺序存储时，您正在逐行遍历矩阵
您使用的是
```
x'
```
（首先转置矩阵），而不是
```
At_mul_B（x，…）
```

简单的修改可以提供更好的性能：

N = 100_000
d = 100
f = randn(N)
x = randn(N, d)
f = randn(N, 1)
x = randn(N, d)

function hess(x, f)
    N, d = size(x);
    temp = zeros(N, d);
    @inbounds for k1 = 1:d
        @simd for kk = 1:N
           temp[kk, k1] = f[kk] * x[kk, k1]
        end
    end
    H = At_mul_B(x, temp)
end
@time hess(x, f)
# 0.067636 seconds (9 allocations: 76.371 MB, 11.24% gc time)

在代码审查时，这个问题可能会问得更好。@JohnWHSmith-Thx，谢谢你的建议，但codereview的Julia用户似乎不多。嘿@Memming，我的回答对你有帮助吗？@iaindenning肯定有帮助。泰克斯：）是的，我确实让她暖和起来了！有没有具体的方法？（我只是在测量时间之前多次运行该函数。）是的，在测量时间之前只运行一次。。。奇怪。在我的机器上要快得多。。。我将用我的完整代码编辑我的答案。N=1e5正确的答案是原始提问者使用的答案，不是吗？同样，由于处理器速度稍快，结果也差不多，虽然稍微低一些。但是第二个版本看起来更好，因为广播避免了getindex/setindex（在这两个版本上都尝试

@profile

），而且可能Matlab实现也会这样做。

bsxfun

也是多线程的。如果您想知道这有什么影响，可以比较

matlab-singleCompThread

。您现在可以使用

SharedArray

s实现多线程算法，而且更普遍的线程化正在进行中。

N = 100_000
d = 100
f = randn(N)
x = randn(N, d)
f = randn(N, 1)
x = randn(N, d)

function hess(x, f)
    N, d = size(x);
    temp = zeros(N, d);
    @inbounds for k1 = 1:d
        @simd for kk = 1:N
           temp[kk, k1] = f[kk] * x[kk, k1]
        end
    end
    H = At_mul_B(x, temp)
end
@time hess(x, f)
# 0.067636 seconds (9 allocations: 76.371 MB, 11.24% gc time)