Python 将数据复制到GPU时没有性能问题_Python_Gpu_Theano_Deep Learning

Python 将数据复制到GPU时没有性能问题

python deep-learning

Python 将数据复制到GPU时没有性能问题,python,gpu,theano,deep-learning,Python,Gpu,Theano,Deep Learning,我在尝试使用theano和lasagne训练深度卷积神经网络时遇到了一些性能问题。我做了一些实验来调查它们的来源。我发现，从主内存向GPU加载成批图像需要很长时间。这里有一个简单的例子来说明这个问题。它只是乘以在批量大小为1、2、4、8、16、，。。。我正在处理448x448大小的RGB图像 import numpy as np import theano import theano.tensor as T import time var = T.ftensor4('inputs') f =

我在尝试使用theano和lasagne训练深度卷积神经网络时遇到了一些性能问题。我做了一些实验来调查它们的来源。我发现，从主内存向GPU加载成批图像需要很长时间。这里有一个简单的例子来说明这个问题。它只是乘以在批量大小为1、2、4、8、16、，。。。我正在处理448x448大小的RGB图像

import numpy as np
import theano
import theano.tensor as T
import time

var = T.ftensor4('inputs')
f = theano.function([var], var)

for batchsize in [2**i for i in range(6)]:
    X = np.zeros((batchsize,3,448,448), dtype=np.float32)
    print "Batchsize", batchsize
    times = []
    start = time.time()
    for i in range(1000):
        f(X)
        times.append(time.time()-start)
        start = time.time()
    print "-> Function evaluation takes:", np.mean(times), "+/-", np.std(times), "sec"

我的结果如下：

Batchsize 1
-> Function evaluation takes: 0.000177580833435 +/- 2.78762612138e-05 sec
Batchsize 2
-> Function evaluation takes: 0.000321553707123 +/- 2.4221262933e-05 sec
Batchsize 4
-> Function evaluation takes: 0.000669012069702 +/- 0.000896798280943 sec
Batchsize 8
-> Function evaluation takes: 0.00137474012375 +/- 0.0032982626882 sec
Batchsize 16
-> Function evaluation takes: 0.176659427643 +/- 0.0330068803715 sec
Batchsize 32
-> Function evaluation takes: 0.356572513342 +/- 0.074931685704 sec

当批量从8增加到16时，注意系数100的增加。这是正常的还是我有一些技术问题？如果是的话，你知道它可能来自哪里吗？感谢您的帮助。如果您运行代码段并报告所看到的内容，也会有所帮助

编辑： Daniel Renshaw指出，这可能与主机GPU复制无关。还有其他问题可能来自哪里的想法吗？更多信息：

函数的theano调试打印读取

DeepCopyOp [@A] 'inputs'   0
 |inputs [@B]

theano分析的输出：

Function profiling                                                      
================== 
Message: theano_test.py:14
Time in 6000 calls to Function.__call__: 3.711728e+03s
Time in Function.fn.__call__: 3.711528e+03s (99.995%)                       
Time in thunks: 3.711491e+03s (99.994%)
Total compile time: 6.542931e-01s
    Number of Apply nodes: 1
    Theano Optimizer time: 7.912159e-03s
        Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 8.321500e-02s
        Import time 2.951717e-02s

Time in all call to theano.grad() 0.000000e+00s
Class 
---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0%   100.0%     3711.491s       6.19e-01s     C     6000       1   theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0%   100.0%     3711.491s       6.19e-01s     C     6000        1   DeepCopyOp
... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
100.0%   100.0%     3711.491s       6.19e-01s   6000     0 DeepCopyOp(inputs)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

INFO (theano.gof.compilelock): Waiting for existing lock by process '3642' (I am process '22124')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/bal8rng/.theano/compiledir_Linux-3.16--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.10-64/lock_dir

函数评测
================== 
消息：theano_测试。py:14
6000次调用中的时间。调用：3.711728e+03s
函数中的时间。fn.\u调用：3.711528e+03s（99.995%）
时间单位：3.711491e+03s（99.994%）
总编译时间：6.542931e-01s
应用节点数：1
Theano优化器时间：7.912159e-03s
Theano验证时间：0.000000e+00s
Theano链接器时间（包括C、CUDA代码生成/编译）：8.321500e-02s
导入时间2.951717e-02s
调用ano.grad（）0.000000e+00s的所有时间
等级
---
100.0%100.0%3711.491s 6.19e-01s C 6000 1 theano.compile.ops.DeepCopyOp
... （剩余的0类占运行时的0.00%（0.00s）
老年退休金
---
100.0%100.0%3711.491s 6.19e-01s C 6000 1 DeepCopyOp
... （剩余的0个操作占运行时间的0.00%（0.00s）
申请
------
100.0%100.0%3711.491s 6.19e-01s 6000 0深度复制（输入）
... （剩余的0个应用实例占运行时的0.00%（0.00s）
信息（theano.gof.compilelock）：正在等待进程“3642”的现有锁（我是进程“22124”）
INFO（theano.gof.compilelock）：要手动释放锁，请删除/home/bal8rng/.theano/compiledir\u Linux-3.16--generic-x86\u 64-with-debian-jessie-sid-x86\u 64-2.7.10-64/lock\u dir

THEANO_标志：

floatX=float32，device=gpu，optimizer\u include=conv\u meta，mode=FAST\u RUN，blas.ldflags=“-L/usr/lib/openblas base-lopenblas”，device=gpu3，assert\u no\u cpu\u op=raise

您的计算几乎肯定不会在gpu上运行！只要您使用的是标准，Theano的优化器就足够聪明，可以看到实际上没有执行任何操作，因此它不会在编译的计算中添加任何“将数据移动到GPU”和“将数据从GPU移回”操作。您可以通过在

f=theano.function（[var]，var）

行之后添加以下行来看到这一点

theano.printing.debugprint(f)

如果您想了解在GPU之间移动数据的开销，Theano的内置软件可能会更好地为您服务。打开评测，然后在输出中，查看在

GpuFromHost

和

HostFromGpu

操作中花费的时间。当然，这必须通过更有意义的计算来完成，在这种计算中，数据确实需要移动

然而，奇怪的是，你得到了你所做的结果。如果计算确实在CPU上运行，我仍然不希望随着批处理大小的增加看到这样的步长变化。如果在GPU上实际运行计算时，您不继续看到相同的行为，那么您可能对此不感兴趣

顺便说一句，运行您的代码（在我的服务器上，尽管在配置中，

device=gpu

实际上是在CPU上运行的，如上所述），我没有得到同样大的阶跃变化；我的时间乘数是2.6、1.9、4.0、3.9、2.0（即，从批量大小=1到批量大小=2等，时间增加了2.6倍）

关于你更新的问题，你真的在乎吗？探索更有意义的计算的性能特征不是更有成效吗？或者你已经确定这种行为在更真实的情况下也会给你带来问题吗？