Machine learning FastChain与Qflux中的GPU_Machine Learning_Neural Network_Julia_Julia Gpu_Julia Flux

Machine learning FastChain与Qflux中的GPU

machine-learning neural-network julia

Machine learning FastChain与Qflux中的GPU,machine-learning,neural-network,julia,julia-gpu,julia-flux,Machine Learning,Neural Network,Julia,Julia Gpu,Julia Flux,对于模型的GPU培训，我使用 dudt = Chain(Dense(3,100,tanh), Dense(100,3)) |> gpu 对 CPU培训 dudt = FastChain( FastDense(3,100,tanh), FastDense(100,3)) 经过1000多次迭代，Fastchain比运行GPU特斯拉K40c快了几个数量级。这是预期的行为吗？否则，在GPU上实现该模型可能会出错。用于G

对于模型的GPU培训，我使用

dudt = Chain(Dense(3,100,tanh),
    Dense(100,3)) |> gpu

对

CPU培训

dudt = FastChain(   
              FastDense(3,100,tanh),
              FastDense(100,3))

经过1000多次迭代，Fastchain比运行GPU特斯拉K40c快了几个数量级。这是预期的行为吗？否则，在GPU上实现该模型可能会出错。用于GPU实施的MWE如下所示：

function lorenz(du,u,p,t)
    σ = p[1]; ρ = p[2]; β = p[3]
    du[1] = σ*(u[2]-u[1])
    du[2] = u[1]*(ρ-u[3]) - u[2]
    du[3] = u[1]*u[2] - β*u[3]
    return 
end
u0 = Float32[1.0,0.0,0.0]               
tspan = (0.0,1.0)                      
para = [10.0,28.0,8/3]                      
prob = ODEProblem(lorenz, u0, tspan, para)  
t = range(tspan[1],tspan[2],length=101)
ode_data = Array(solve(prob,Tsit5(),saveat=t))
ode_data = cu(ode_data)

u0train = [1.0,0.0,0.0] |> gpu
tspantrain = (0.0,1.0)  
ttrain = range(tspantrain[1],tspantrain[2],length=101)  
dudt = Chain(Dense(3,100,tanh),
    Dense(100,3)) |> gpu
n_ode = NeuralODE((dudt),tspantrain,Tsit5(),saveat=ttrain)

function predict_n_ode(p)
  n_ode(u0train,p)
end

function loss_n_ode(p)
    pred = predict_n_ode(p) |> gpu
    loss = sum(abs2, pred .- ode_data)
    loss,pred
end

res1 = DiffEqFlux.sciml_train(loss_n_ode, n_ode.p, ADAM(0.01), cb=cb, maxiters = 1000)

该模型太小，GPU并行性无法真正发挥作用。神经网络本质上是一个3个矩阵，100x3，100x100，3x100。唯一一个内核可能接近收支平衡的是中间的内核，其中100x100矩阵乘以长度为100的向量

例如，在我的机器上：

using BenchmarkTools, CuArrays
A = rand(100,100); x = rand(100);
@btime A*x; # 56.299 μs (1 allocation: 896 bytes)
gA = cu(A); gx = cu(x)
@btime gA*gx; # 12.499 μs (6 allocations: 160 bytes)

A = rand(100,3); x = rand(3);
@btime A*x; # 251.695 ns (1 allocation: 896 bytes)
gA = cu(A); gx = cu(x)
@btime gA*gx; # 12.212 μs (6 allocations: 160 bytes)

因此，虽然最大操作的加速确实存在，但将其他小操作放到GPU上还不足以克服减速。这是因为GPU有一个很高的地板（在我的机器上大约12μs），所以你必须确保你的问题足够大，使它真正有意义。一般来说，机器学习得益于GPU，因为它主要由数万层的大型矩阵乘法控制