Keras 当涉及到无预测时,GPU比CPU慢?
我使用keras+theano预测NVidia TK1上VGG预训练模型的标签 在预测中,我从CPU得到的预测时间比从GPU得到的要快。如果我的记忆是正确的,预测还涉及到大量重复的数字运算。我不明白为什么这里的CPU会慢一些 有人有好的解释吗 GPU详细信息行:Keras 当涉及到无预测时,GPU比CPU慢?,keras,theano,prediction,theano-cuda,Keras,Theano,Prediction,Theano Cuda,我使用keras+theano预测NVidia TK1上VGG预训练模型的标签 在预测中,我从CPU得到的预测时间比从GPU得到的要快。如果我的记忆是正确的,预测还涉及到大量重复的数字运算。我不明白为什么这里的CPU会慢一些 有人有好的解释吗 GPU详细信息行:使用GPU设备0:GK20A(启用CNMeM时初始大小为内存的75.0%,cuDNN版本太旧。更新到v5,was 2000)。 下面是预测的分析结果: Class --- <% time> <sum %> <
使用GPU设备0:GK20A(启用CNMeM时初始大小为内存的75.0%,cuDNN版本太旧。更新到v5,was 2000)。
下面是预测的分析结果:
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
39.5% 39.5% 0.019s 6.42e-03s C 3 3 theano.sandbox.cuda.blas.GpuDot22
24.8% 64.3% 0.012s 6.04e-03s C 2 2 theano.sandbox.cuda.blas.GpuCorrMM
16.4% 80.8% 0.008s 1.33e-03s C 6 6 theano.sandbox.cuda.basic_ops.GpuElemwise
7.8% 88.5% 0.004s 1.89e-03s C 2 2 theano.sandbox.cuda.blas.GpuDownsampleFactorMax
4.2% 92.7% 0.002s 2.03e-03s C 1 1 theano.sandbox.rng_mrg.GPU_mrg_uniform
3.8% 96.4% 0.002s 4.57e-04s C 4 4 theano.sandbox.cuda.basic_ops.GpuContiguous
2.3% 98.8% 0.001s 5.66e-04s C 2 2 theano.sandbox.cuda.basic_ops.GpuFromHost
0.5% 99.3% 0.000s 2.51e-04s C 1 1 theano.sandbox.cuda.nnet.GpuSoftmaxWithBias
0.5% 99.8% 0.000s 2.39e-04s C 1 1 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1% 99.8% 0.000s 1.37e-05s C 3 3 theano.sandbox.cuda.basic_ops.GpuReshape
0.0% 99.9% 0.000s 9.54e-06s C 2 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 99.9% 0.000s 4.35e-06s C 4 4 theano.tensor.elemwise.Elemwise
0.0% 99.9% 0.000s 5.01e-06s C 2 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.000s 3.26e-06s C 3 3 theano.compile.ops.Shape_i
0.0% 100.0% 0.000s 4.53e-06s C 2 2 theano.tensor.opt.MakeVector
0.0% 100.0% 0.000s 5.96e-06s C 1 1 theano.tensor.elemwise.Prod
0.0% 100.0% 0.000s 3.10e-06s C 1 1 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
39.5% 39.5% 0.019s 6.42e-03s C 3 3 GpuDot22
24.8% 64.3% 0.012s 6.04e-03s C 2 2 GpuCorrMM{valid, (1, 1)}
11.2% 75.5% 0.005s 1.36e-03s C 4 4 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]
7.8% 83.3% 0.004s 1.89e-03s C 2 2 GpuDownsampleFactorMax{(2, 2),True}
4.2% 87.4% 0.002s 2.03e-03s C 1 1 GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}
3.8% 91.2% 0.002s 4.57e-04s C 4 4 GpuContiguous
2.9% 94.1% 0.001s 1.43e-03s C 1 1 GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)]
2.3% 96.5% 0.001s 5.66e-04s C 2 2 GpuFromHost
2.3% 98.8% 0.001s 1.12e-03s C 1 1 GpuElemwise{Composite{Switch(i0, (i1 * i2 * i3), i2)}}[(0, 2)]
0.5% 99.3% 0.000s 2.51e-04s C 1 1 GpuSoftmaxWithBias
0.5% 99.8% 0.000s 2.39e-04s C 1 1 HostFromGpu
0.1% 99.8% 0.000s 1.60e-05s C 2 2 GpuReshape{4}
0.0% 99.9% 0.000s 9.54e-06s C 2 2 GpuSubtensor{::, ::, ::int64, ::int64}
0.0% 99.9% 0.000s 5.01e-06s C 2 2 GpuDimShuffle{x,0}
0.0% 99.9% 0.000s 4.53e-06s C 2 2 MakeVector{dtype='int64'}
0.0% 99.9% 0.000s 9.06e-06s C 1 1 GpuReshape{2}
0.0% 99.9% 0.000s 4.17e-06s C 2 2 Elemwise{Composite{((i0 + ((i1 + i2) // i3)) // i3)}}[(0, 2)]
0.0% 100.0% 0.000s 5.96e-06s C 1 1 Prod{acc_dtype=int64}
0.0% 100.0% 0.000s 5.96e-06s C 1 1 Elemwise{Cast{float32}}
0.0% 100.0% 0.000s 5.01e-06s C 1 1 Shape_i{0}
... (remaining 4 Ops account for 0.02%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
36.4% 36.4% 0.018s 1.77e-02s 1 33 GpuDot22(GpuReshape{2}.0, dense_5_W)
15.7% 52.1% 0.008s 7.64e-03s 1 18 GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
9.1% 61.2% 0.004s 4.44e-03s 1 28 GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
5.7% 66.9% 0.003s 2.76e-03s 1 25 GpuDownsampleFactorMax{(2, 2),True}(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0)
4.2% 71.0% 0.002s 2.03e-03s 1 20 GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
3.6% 74.6% 0.002s 1.74e-03s 1 34 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
3.2% 77.8% 0.002s 1.54e-03s 1 22 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuCorrMM{valid, (1, 1)}.0, GpuReshape{4}.0)
2.9% 80.7% 0.001s 1.43e-03s 1 23 GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)](GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}.1, CudaNdarrayConstant{[[[[ 0.80000001]]]]})
2.7% 83.4% 0.001s 1.29e-03s 1 36 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
2.3% 85.7% 0.001s 1.12e-03s 1 31 GpuElemwise{Composite{Switch(i0, (i1 * i2 * i3), i2)}}[(0, 2)](GpuFromHost.0, CudaNdarrayConstant{[[[[ 1.25]]]]}, GpuDownsampleFactorMax{(2, 2),True}.0, GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)].0)
2.2% 87.8% 0.001s 1.06e-03s 1 14 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
2.2% 90.0% 0.001s 1.06e-03s 1 35 GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, dense_6_W)
2.1% 92.1% 0.001s 1.01e-03s 1 30 GpuDownsampleFactorMax{(2, 2),True}(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0)
2.0% 94.1% 0.001s 9.61e-04s 1 3 GpuFromHost(convolution2d_input_1)
1.8% 95.9% 0.001s 8.71e-04s 1 29 GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuCorrMM{valid, (1, 1)}.0, GpuReshape{4}.0)
1.6% 97.4% 0.001s 7.58e-04s 1 15 GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
1.0% 98.4% 0.000s 4.72e-04s 1 37 GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, dense_7_W)
0.5% 98.9% 0.000s 2.51e-04s 1 38 GpuSoftmaxWithBias(GpuDot22.0, dense_7_b)
0.5% 99.4% 0.000s 2.39e-04s 1 39 HostFromGpu(GpuSoftmaxWithBias.0)
0.3% 99.8% 0.000s 1.70e-04s 1 19 GpuFromHost(Elemwise{Cast{float32}}.0)
... (remaining 20 Apply instances account for 0.25%(0.00s) of the runtime)
类
---
39.5%39.5%0.019s 6.42e-03s C 3 theano.sandbox.cuda.blas.GpuDot22
24.8%64.3%0.012s 6.04e-03s C 2 theano.sandbox.cuda.blas.gpucormm
16.4%80.8%0.008s 1.33e-03s C 6 theano.sandbox.cuda.basic_ops.gpuelWise
7.8%88.5%0.004s 1.89e-03s C 2 theano.sandbox.cuda.blas.gpudown采样因子最大值
4.2%92.7%0.002s 2.03e-03s C 1 THANO.sandbox.rng\U mrg.GPU\U mrg\U制服
3.8%96.4%0.002s 4.57e-04s C 4 theano.sandbox.cuda.basic_ops.gpucontius
2.3%98.8%0.001s 5.66e-04s C 2 theano.sandbox.cuda.basic_ops.GpuFromHost
0.5%99.3%0.000s 2.51e-04s C 1 theano.sandbox.cuda.nnet.GpuSoftmaxWithBias
0.5%99.8%0.000s 2.39e-04s C 1 theano.sandbox.cuda.basic_ops.HostFromGpu
0.1%99.8%0.000s 1.37e-05s C 3 theano.sandbox.cuda.basic_ops.GpuReshape
0.0%99.9%0.000s 9.54e-06s C 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0%99.9%0.000s 4.35e-06s C4 THANO.tensor.elemwise.elemwise
0.0%99.9%0.000s 5.01e-06s C 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0%100.0%0.000s 3.26e-06s C 3 3 theano.compile.ops.Shape_i
0.0%100.0%0.000s 4.53e-06s C 2 theano.tensor.opt.MakeVector
0.0%100.0%0.000s 5.96e-06s C 1 theano.tensor.elemwise.Prod
0.0%100.0%0.000s 3.10e-06s C 1 theano.tensor.elemwise.DimShuffle
... (剩余的0类占运行时的0.00%(0.00s)
老年退休金
---
39.5%39.5%0.019s 6.42e-03s C 3 3 GpuDot22
24.8%64.3%0.012s 6.04e-03s C2 2 GpuCorrMM{有效,(1,1)}
11.2%75.5%0.005s1.36e-03sC4 gpuelemise{Composite{(i0*((i1+i2)+Abs((i1+i2))}[(0,1)]
7.8%83.3%0.004s 1.89e-03s C 2 2 GpuDownsampleFactorMax{(2,2),True}
4.2%87.4%0.002s 2.03e-03s C1 GPU_mrg_统一{cudandarayType(float32,4D),就地}
3.8%91.2%0.002s 4.57e-04s C4 gpu孔隙率
2.9%94.1%0.001s1.43e-03s1 GpuElemwise{Composite{Cast{float32}(LT(i0,i1))}[(0,0)]
2.3%96.5%0.001s 5.66e-04s C 2 2 GpuFromHost
2.3%98.8%0.001s1.12e-03s1 GpuElemwise{Composite{Switch(i0,(i1*i2*i3),i2)}[(0,2)]
0.5%99.3%0.000s 2.51e-04s C 1 gpusoftmaxwith bias
0.5%99.8%0.000s 2.39e-04s C 1主机来自GPU
0.1%99.8%0.000s1.60e-05s2 GpuReshape{4}
0.0%99.9%0.000s 9.54e-06s C2 2 GPU传感器{:,:,:,::int64,::int64}
0.0%99.9%0.000s 5.01e-06s C2 2 GpuDimShuffle{x,0}
0.0%99.9%0.000s 4.53e-06s C2生成向量{dtype='int64'}
0.0%99.9%0.000S9.06e-06SC11 GpuReshape{2}
0.0%99.9%0.000S4.17e-06S22元素{Composite{((i0+((i1+i2)//i3))//i3}[(0,2)]
0.0%100.0%0.000s 5.96e-06s C1产品{acc_dtype=int64}
0.0%100.0%0.000s 5.96e-06s C 1 1元素{Cast{float32}
0.0%100.0%0.000s 5.01e-06s C1形状{0}
... (其余4个操作占运行时间的0.02%(0.00s)
申请
------
36.4%36.4%0.018s1.77e-02s133gpudot22(GpuReshape{2}.0,密集型)
15.7%52.1%0.008s 7.64e-03s 1 18 GpuCorrMM{有效,(1,1)}(gpucontigous.0,gpucontigous.0)
9.1%61.2%0.004s4.44e-03s128 GpuCorrMM{有效,(1,1)}(gpucontigous.0,gpucontigous.0)
5.7%66.9%0.003S2.76e-03S125GPudownSampleFactorMax{(2,2),True}(GPuelWise{Composite{(i0*((i1+i2)+Abs((i1+i2)))}[(0,1)].0)
4.2%71.0%0.002s 2.03e-03s 1 20 GPU_mrg_统一{cudandarayType(float32,4D),inplace}(,MakeVector{dtype='int64'}.0)
3.6%74.6%0.002s1.74e-03s134 GpuElemwise{Composite{(i0*((i1+i2)+Abs((i1+i2))}[(0,1)](cudandarayconstant{[[0.5]},GpuDot22.0,GpuDimShuffle{x,0}.0)
3.2%77.8%0.002s1.54e-03s122 GpuElemwise{Composite{(i0*((i1+i2)+Abs((i1+i2))}[(0,1)](cudandarayconstant{[[[0.5]]]},GpuCorrMM{valid,(1,1)}.0,GpuReshape{4}.0)
2.9%80.7%0.001s 1.43e-03s 1 23GPUELEMWISE{复合材料{Cast{float32}(LT(i0,i1))}[(0,0)](GPU_mrg_统一{CudaNdarrayType(float32,4D),就地}.1,CudaNdarrayC